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reface 


“There’s gold in them thar hills!” 
* Source unknown, frequently misattributed to Mark Twain. 


Welcome to Python for Programmers! In this book, you'll learn hands-on with today’s most 
compelling, leading-edge computing technologies, and you'll program in Python—one of the 


world’s most popular languages and the fastest growing among them. 


Developers often quickly discover that they like Python. They appreciate its expressive power, 
readability, conciseness and interactivity. They like the world of open-source software 
development that’s generating a rapidly growing base of reusable software for an enormous 


range of application areas. 


For many decades, some powerful trends have been in place. Computer hardware has rapidly 
been getting faster, cheaper and smaller. Internet bandwidth has rapidly been getting larger 
and cheaper. And quality computer software has become ever more abundant and essentially 
free or nearly free through the “open source” movement. Soon, the “Internet of Things” will 
connect tens of billions of devices of every imaginable type. These will generate enormous 


volumes of data at rapidly increasing speeds and quantities. 


In computing today, the latest innovations are “all about the data”—data science, data 
analytics, big data, relational databases (SQL), and NoSQL and NewSQL databases, each of 


which we address along with an innovative treatment of Python programming. 


JOBS REQUIRING DATA SCIENCE SKILLS 


In 2011, McKinsey Global Institute produced their report, “Big data: The next frontier for 
innovation, competition and productivity.” In it, they said, “The United States alone faces a 
shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million 
managers and analysts to analyze big data and make decisions based on their findings.” ° 
This continues to be the case. The August 2018 “LinkedIn Workforce Report” says the United 
States has a shortage of over 150,000 people with data science skills. 3 A 2017 report from 
IBM, Burning Glass Technologies and the Business-Higher Education Forum, says that by 
2020 in the United States there will be hundreds of thousands of new jobs requiring data 


science skills. 4 


2 


ttps://www.mckinsey.com/~/media/McKinsey/Business%20Functions/McKinsey%20Digital/Our%201 


page 3). 


3 ttps://economicgraph. linkedin. com/resources/linkedin-workforce- 


eport-august-2018. 


4 ttps://www.burning-glass.com/wp- 


ontent/uploads/The Quant _Crunch.pdf (page 3). 


MODULAR ARCHITECTURE 


The book’s modular architecture (please see the Table of Contents graphic on the 


book’s inside front cover) helps us meet the diverse needs of various professional audiences. 


hapters 1— o cover Python programming. These chapters each include a brief Intro to 
Data Science section introducing artificial intelligence, basic descriptive statistics, 
measures of central tendency and dispersion, simulation, static and dynamic visualization, 


working with CSV files, pandas for data exploration and data wrangling, time series and 


imple linear regression. These help you prepare for the data science, AI, big data and cloud 
case studies in hapters 11- 6, which present opportunities for you to use real-world 


datasets in complete case studies. 


After covering Python hapters1— 5 and a few key parts of hapters 6- 7 , you'll be able to 


handle significant portions of the case studies in hapters 11- 6. The “Chapter 
Dependencies” section of this Preface will help trainers plan their professional courses in the 


context of the book’s unique architecture. 


hapters 11— 6 are loaded with cool, powerful, contemporary examples. They present hands- 
on implementation case studies on topics such as natural language processing, data 
mining Twitter, cognitive computing with IBM’s Watson, supervised machine 
learning with classification and regression, unsupervised machine learning with 
clustering, deep learning with convolutional neural networks, deep learning 
with recurrent neural networks, big data with Hadoop, Spark and NoSQL 
databases, the Internet of Things and more. Along the way, you'll acquire a broad 
literacy of data science terms and concepts, ranging from brief definitions to using concepts 
in small, medium and large programs. Browsing the book’s detailed Table of Contents and 


Index will give you a sense of the breadth of coverage. 


KEY FEATURES 


KIS (Keep It Simple), KIS (Keep it Small), KIT (Keep it Topical) 


e Keep it simple—In every aspect of the book, we strive for simplicity and clarity. For 
example, when we present natural language processing, we use the simple and intuitive 
TextBlob library rather than the more complex NLTK. In our deep learning 
presentation, we prefer Keras to TensorFlow. In general, when multiple libraries could 


be used to perform similar tasks, we use the simplest one. 


e Keep it small—Most of the book’s 538 examples are small—often just a few lines of 
code, with immediate interactive [Python feedback. We also include 40 larger scripts 
and in-depth case studies. 


e Keep it topical—We read scores of recent Python-programming and data science books, 
and browsed, read or watched about 15,000 current articles, research papers, white 
papers, videos, blog posts, forum posts and documentation pieces. This enabled us to 
“take the pulse” of the Python, computer science, data science, AI, big data and cloud 
communities. 


Immediate-Feedback: Exploring, Discovering and Experimenting with IPython 


e The ideal way to learn from this book is to read it and run the code examples in parallel. 
Throughout the book, we use the [Python interpreter, which provides a friendly, 
immediate-feedback interactive mode for quickly exploring, discovering and 


experimenting with Python and its extensive libraries. 


e Most of the code is presented in small, interactive [Python sessions. For each code 
snippet you write, [Python immediately reads it, evaluates it and prints the results. This 
instant feedback keeps your attention, boosts learning, facilitates rapid prototyping 
and speeds the software-development process. 


e Our books always emphasize the live-code approach, focusing on complete, working 


E 


programs with live inputs and outputs. IPython’s “magic” is that it turns even snippets 
into code that “comes alive” as you enter each line. This promotes learning and 


encourages experimentation. 


Python Programming Fundamentals 
e First and foremost, this book provides rich Python coverage. 


e We discuss Python’s programming models—procedural programming, functional- 


tyle programming and object-oriented programming. 
e We use best practices, emphasizing current idiom. 


e Functional-style programming is used throughout the book as appropriate. A chart 
in hapter 4 lists most of Python’s key functional-style programming capabilities and the 


chapters in which we initially cover most of them. 


538 Code Examples 


e You'll get an engaging, challenging and entertaining introduction to Python with 538 
real-world examples ranging from individual snippets to substantial computer 


science, data science, artificial intelligence and big data case studies. 


e You'll attack significant tasks with AI, big data and cloud technologies like natural 
language processing, data mining Twitter, machine learning, deep learning, 
Hadoop, MapReduce-, Spark, IBM Watson, key data science libraries (NumPy, 
pandas, SciPy, NLTK, TextBlob, spaCy, Textatistic, Tweepy, Scikit-learn, 
Keras), key visualization libraries (Matplotlib, Seaborn, Folium) and more. 


Avoid Heavy Math in Favor of English Explanations 


e We capture the conceptual essence of the mathematics and put it to work in our 
examples. We do this by using libraries such as statistics, NumPy, SciPy, pandas and 
many others, which hide the mathematical complexity. So, it’s straightforward for you to 
get many of the benefits of mathematical techniques like linear regression without 
having to know the mathematics behind them. In the machine-learning and deep- 
learning examples, we focus on creating objects that do the math for you “behind the 


scenes.” 


Visualizations 


e 67 static, dynamic, animated and interactive visualizations (charts, graphs, 


pictures, animations etc.) help you understand concepts. 


e Rather than including a treatment of low-level graphics programming, we focus on high- 
level visualizations produced by Matplotlib, Seaborn, pandas and Folium (for 
interactive maps). 


e We use visualizations as a pedagogic tool. For example, we make the law of large 
numbers “come alive” in a dynamic die-rolling simulation and bar chart. As the 
number of rolls increases, you'll see each face’s percentage of the total rolls gradually 


approach 16.667% (1/6th) and the sizes of the bars representing the percentages equalize. 


e Visualizations are crucial in big data for data exploration and communicating 
reproducible research results, where the data items can number in the millions, 
billions or more. A common saying is that a picture is worth a thousand words ° —in big 
data, a visualization could be worth billions, trillions or even more items in a database. 
Visualizations enable you to “fly 40,000 feet above the data” to see it “in the large” and to 
get to know your data. Descriptive statistics help but can be misleading. For example, 
Anscombe’s quartet ° demonstrates through visualizations that significantly different 


datasets can have nearly identical descriptive statistics. 
5 
ttps://en.wikipedia.org/wiki/A picture is worth a _ thousand words. 


8 ttps://en.wikipedia.org/wiki/Anscombe%27s_ quartet. 

e We show the visualization and animation code so you can implement your own. We also 
provide the animations in source-code files and as Jupyter Notebooks, so you can 
conveniently customize the code and animation parameters, re-execute the animations 
and see the effects of the changes. 


Data Experiences 


e Our Intro to Data Science sections and case studies in hapters 11— 6 provide rich 


data experiences. 


e You'll work with many real-world datasets and data sources. There’s an enormous 
variety of free open datasets available online for you to experiment with. Some of the 


sites we reference list hundreds or thousands of datasets. 
e Many libraries you'll use come bundled with popular datasets for experimentation. 


e You'll learn the steps required to obtain data and prepare it for analysis, analyze that data 
using many techniques, tune your models and communicate your results effectively, 


especially through visualization. 


GitHub 


e GitHub is an excellent venue for finding open-source code to incorporate into your 
projects (and to contribute your code to the open-source community). It’s also a crucial 
element of the software developer’s arsenal with version control tools that help teams of 


developers manage open-source (and private) projects. 


e You'll use an extraordinary range of free and open-source Python and data science 
libraries, and free, free-trial and freemium offerings of software and cloud services. 


Many of the libraries are hosted on GitHub. 


Hands-On Cloud Computing 


e Much of big data analytics occurs in the cloud, where it’s easy to scale dynamically the 
amount of hardware and software your applications need. You'll work with various cloud- 
based services (some directly and some indirectly), including Twitter, Google 
Translate, IBM Watson, Microsoft Azure, OpenMapQuest, geopy, Dweet.io and 
PubNub. 


e e We encourage you to use free, free trial or freemium cloud services. We prefer those that 
don’t require a credit card because you don’t want to risk accidentally running up big bills. 
If you decide to use a service that requires a credit card, ensure that the tier 
you’re using for free will not automatically jump to a paid tier. 


Database, Big Data and Big Data Infrastructure 


e According to IBM (Nov. 2016), 90% of the world’s data was created in the last two years. 


7 Evidence indicates that the speed of data creation is rapidly accelerating. 
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ttps://public.dhe.ibm.com/common/ssi/ecm/wr/en/wrl12345usen/watson- 
customer-engagement--watson-marketing-wr-other-papers-and-reports- 


r112345usen-20170719.pdf. 


e According to a March 2016 AnalyticsWeek article, within five years there will be over 50 
billion devices connected to the Internet and by 2020 we'll be producing 1.7 megabytes of 


new data every second for every person on the planet! è 


8 ttps://analyticsweek.com/content/big-data-facts/. 


e We include a treatment of relational databases and SQL with SQLite. 


e Databases are critical big data infrastructure for storing and manipulating the 
massive amounts of data you'll process. Relational databases process structured data— 
they’re not geared to the unstructured and semi-structured data in big data applications. 
So, as big data evolved, NoSQL and NewSQL databases were created to handle such 
data efficiently. We include a NoSQL and NewSQL overview and a hands-on case study 
with a MongoDB JSON document database. MongoDB is the most popular NoSQL 


database. 


e We discuss big data hardware and software infrastructure in hapter 16,“ ig 


ata: Hadoop, Spark, NoSQL and IoT (Internet of Things).” 


Artificial Intelligence Case Studies 


e Incase study hapters 11- 5, we present artificial intelligence topics, including 
natural language processing, data mining Twitter to perform sentiment 
analysis, cognitive computing with IBM Watson, supervised machine 
learning, unsupervised machine learning and deep learning. hapter 16 presents 
the big data hardware and software infrastructure that enables computer scientists and 


data scientists to implement leading-edge AlI-based solutions. 


Built-In Collections: Lists, Tuples, Sets, Dictionaries 


e There’s little reason today for most application developers to build custom data 
structures. The book features a rich two-chapter treatment of Python’s built-in 
data structures—lists, tuples, dictionaries and sets—with which most data- 
structuring tasks can be accomplished. 


Array-Oriented Programming with NumPy Arrays and Pandas 
Series/DataFrames 


e We also focus on three key data structures from open-source libraries—NumPy arrays, 
pandas Series and pandas DataFrames. These are used extensively in data science, 
computer science, artificial intelligence and big data. NumPy offers as much as two orders 


of magnitude higher performance than built-in Python lists. 


e Weincludein hapter 7 a rich treatment of NumPy arrays. Many libraries, such as 
pandas, are built on NumPy. The Intro to Data Science sections in hapters 7- 9 


introduce pandas Series and DataFrames, which along with NumPy arrays are then 


used throughout the remaining chapters. 


File Processing and Serialization 


e hapter 9 presents text-file processing, then demonstrates how to serialize objects 
using the popular JSON (JavaScript Object Notation) format. JSON is used 
frequently in the data science chapters. 


e Many data science libraries provide built-in file-processing capabilities for loading 
datasets into your Python programs. In addition to plain text files, we process files in the 
popular CSV (comma-separated values) format using the Python Standard 
Library’s csv module and capabilities of the pandas data science library. 


Object-Based Programming 


e We emphasize using the huge number of valuable classes that the Python open-source 
community has packaged into industry standard class libraries. You'll focus on knowing 
what libraries are out there, choosing the ones you'll need for your apps, creating objects 
from existing classes (usually in one or two lines of code) and making them “jump, dance 
and sing.” This object-based programming enables you to build impressive 


applications quickly and concisely, which is a significant part of Python’s appeal. 


e With this approach, you'll be able to use machine learning, deep learning and other AI 
technologies to quickly solve a wide range of intriguing problems, including cognitive 


computing challenges like speech recognition and computer vision. 


Object-Oriented Programming 


e Developing custom classes is a crucial object-oriented- programming skill, along 
with inheritance, polymorphism and duck typing. We discuss these in hapter 10. 


e hapter 10 includes a discussion of unit testing with doctest and a fun card- 


shuffling-and-dealing simulation. 


e hapters 11- 6 require only a few straightforward custom class definitions. In Python, 
you'll probably use more of an object-based programming approach than full-out object- 


oriented programming. 


Reproducibility 


e In the sciences in general, and data science in particular, there’s a need to reproduce the 
results of experiments and studies, and to communicate those results effectively. Jupyter 


Notebooks are a preferred means for doing this. 


e We discuss reproducibility throughout the book in the context of programming 
techniques and software such as Jupyter Notebooks and Docker. 


Performance 


e We use the $timeit profiling tool in several examples to compare the performance of 
different approaches to performing the same tasks. Other performance-related 
discussions include generator expressions, NumPy arrays vs. Python lists, performance of 
machine-learning and deep-learning models, and Hadoop and Spark distributed- 


computing performance. 


Big Data and Parallelism 


e In this book, rather than writing your own parallelization code, you'll let libraries like 
Keras running over TensorFlow, and big data tools like Hadoop and Spark parallelize 
operations for you. In this big data/AI era, the sheer processing requirements of massive 
data applications demand taking advantage of true parallelism provided by multicore 
processors, graphics processing units (GPUs), tensor processing units (TPUs) 
and huge clusters of computers in the cloud. Some big data tasks could have 
thousands of processors working in parallel to analyze massive amounts of data 


expeditiously. 


CHAPTER DEPENDENCIES 


If you're a trainer planning your syllabus for a professional training course or a developer 
deciding which chapters to read, this section will help you make the best decisions. Please 
read the one-page color Table of Contents on the book’s inside front cover—this will 
quickly familiarize you with the book’s unique architecture. Teaching or reading the chapters 


in order is easiest. However, much of the content in the Intro to Data Science sections at the 
ends of hapters 1- o and the case studies in hapters 11— 6 requires only hapters 1- 5 


and small portions of hapters 6— o as discussed below. 


Part 1: Python Fundamentals Quickstart 


We recommend that you read all the chapters in order: 


e hapter 1, Introduction to Computers and Python, introduces concepts that lay 
the groundwork for the Python programming in hapters 2- o and the big data, 
artificial-intelligence and cloud-based case studies in hapters 11— 6. The chapter also 
includes test-drives of the [Python interpreter and Jupyter Notebooks. 


e hapter 2, Introduction to Python Programming, presents Python programming 


fundamentals with code examples illustrating key language features. 


e hapter 3, Control Statements, presents Python’s control statements and 


introduces basic list processing. 


e hapter 4, Functions, introduces custom functions, presents simulation 
techniques with random-number generation and introduces tuple 


fundamentals. 


e hapter 5, Sequences: Lists and Tuples, presents Python’s built-in list and tuple 


collections in more detail and begins introducing functional-style programming. 


art 2: Python Data Structures, Strings and Files 
The following summarizes inter-chapter dependencies for Python hapters 6- 9 and 


assumes that you’ve read hapters 1- 5. 


e hapter 6, Dictionaries and Sets—The Intro to Data Science section in this chapter is 


not dependent on the chapter’s contents. 


e hapter 7, Array-Oriented Programming with NumPy—The Intro to Data Science 


section requires dictionaries ( hapter 6) and arrays ( hapter 7). 


e hapter 8, Strings: A Deeper Look—The Intro to Data Science section requires raw 
strings and regular expressions ( ections 8.11- .12), and pandas Series and 


DataFrame features from ection 7.14’s Intro to Data Science. 


e hapter 9, Files and Exceptions—For JSON serialization, it’s useful to understand 
dictionary fundamentals ( ection 6.2). Also, the Intro to Data Science section requires the 
built-in open function and the with statement ( ection 9.3), and pandas DataFrame 


features from ection 7.14’s Intro to Data Science. 


Part 3: Python High-End Topics 
The following summarizes inter-chapter dependencies for Python hapter 10 and assumes 


that you’ve read hapters 1- 5. 


e hapter 10, Object-Oriented Programming—tThe Intro to Data Science section 
requires pandas DataFrame features from Intro to Data Science ection 7.14. Trainers 
wanting to cover only classes and objects can present ections 10.1— 0.6. Trainers 
wanting to cover more advanced topics like inheritance, polymorphism and duck 
typing, can present ections 10.7— 0.9. ections 10.10— 0.15 provide additional 


advanced perspectives. 


Part 4: Al, Cloud and Big Data Case Studies 
The following summary of inter-chapter dependencies for hapters 11- 6 assumes that 
you've read hapters1— 5 .Mostof hapters 11- 6 also require dictionary fundamentals 


from ection 6.2. 


e hapter 11, Natural Language Processing (NLP), uses pandas DataFrame features 


from ection 7.14’s Intro to Data Science. 


e hapter 12, Data Mining Twitter, uses pandas DataFrame features from ection 
.14’s Intro to Data Science, string method join ( ection 8.9), JSON fundamentals 
( ection 9.5), TextBlob ( ection 11.2) and Word clouds ( ection 11.3). Several examples 


require defining a class via inheritance ( hapter 10). 


e hapter 13, IBM Watson and Cognitive Computing, uses built-in function open 


and the with statement ( ection 9.3). 


e hapter 14, Machine Learning: Classification, Regression and Clustering, uses 
NumPy array fundamentals and method unique ( hapter 7), pandas DataFrame 
features from ection 7.14’s Intro to Data Science and Matplotlib function subplots 


( ection 10.6). 


e hapter 15, Deep Learning, requires NumPy array fundamentals ( hapter 7), string 
method join ( ection 8.9), general machine-learning concepts from hapter 14 and 
features from hapter 14’s Case Study: Classification with k-Nearest Neighbors and the 
Digits Dataset. 


e hapter16, ig Data: Hadoop, Spark, NoSQL and IoT, uses string method split 
( ection 6.2.7), Matplotlib FuncAnimation from ection 6.4’s Intro to Data Science, 


pandas Series and DataFrame features from ection 7.14’s Intro to Data Science, string 


ethod join ( ection 8.9), the json module ( ection 9.5), NLTK stop words ( ection 
1.2.13) and from hapter 12, Twitter authentication, Tweepy’s StreamListener class 
for streaming tweets, and the geopy and folium libraries. A few examples require 
defining a class via inheritance ( hapter 10), but you can simply mimic the class 
definitions we provide without reading hapter 10. 


JUPYTER NOTEBOOKS 


For your convenience, we provide the book’s code examples in Python source code (. py) 
files for use with the command-line [Python interpreter and as Jupyter Notebooks 


(. ipynb) files that you can load into your web browser and execute. 


Jupyter Notebooks is a free, open-source project that enables you to combine text, 
graphics, audio, video, and interactive coding functionality for entering, editing, executing, 
debugging, and modifying code quickly and conveniently in a web browser. According to the 
article, “What Is Jupyter?”: 


Jupyter has become a standard for scientific research and data analysis. It packages 
computation and argument together, letting you build “computational narratives”; and it 
9 


simplifies the problem of distributing working software to teammates and associates. 
9 ttps://www.oreilly.com/ideas/what-is-jupyter. 


In our experience, it’s a wonderful learning environment and rapid prototyping tool. For 
this reason, we use Jupyter Notebooks rather than a traditional IDE, such as Eclipse, 
Visual Studio, PyCharm or Spyder. Academics and professionals already use Jupyter 
extensively for sharing research results. Jupyter Notebooks support is provided through the 
traditional open-source community mechanisms ° (see “Getting Jupyter Help” later in this 
Preface). See the Before You Begin section that follows this Preface for software installation 


details and see the test-drives in ection 1.5 for information on running the book’s examples. 


© ttps://jupyter.org/community. 


Collaboration and Sharing Results 


Working in teams and communicating research results are both important for developers in 


or moving into data-analytics positions in industry, government or academia: 


e The notebooks you create are easy to share among team members simply by copying 
the files or via GitHub. 


e Research results, including code and insights, can be shared as static web pages via tools 
like nbviewer ( ttps://nbviewer.jupyter.org) and GitHub—both automatically 
render notebooks as web pages. 


Reproducibility: A Strong Case for Jupyter Notebooks 


In data science, and in the sciences in general, experiments and studies should be 


reproducible. This has been written about in the literature for many years, including 
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e Donald Knuth’s 1992 computer science publication—Literate Programming. 


‘Knuth, D., “Literate Programming” (PDF), The Computer Journal, British Computer 
Society, 1992. 


e The article “Language-Agnostic Reproducible Data Analysis Using Literate 
Programming,” * which says, “Lir (literate, reproducible computing) is based on the idea 
of literate programming as proposed by Donald Knuth.” 


> ttp://journals.plos.org/plosone/article? 


d=10.1371/journal.pone.0164023. 


Essentially, reproducibility captures the complete environment used to produce results— 


hardware, software, communications, algorithms (especially code), data and the data’s 


rovenance (origin and lineage). 


DOCKER 


In hapter 16, we'll use Docker—a tool for packaging software into containers that bundle 
everything required to execute that software conveniently, reproducibly and portably across 
platforms. Some software packages we use in hapter 16 require complicated setup and 
configuration. For many of these, you can download free preexisting Docker containers. 
These enable you to avoid complex installation issues and execute software locally on your 
desktop or notebook computers, making Docker a great way to help you get started with new 


technologies quickly and conveniently. 


Docker also helps with reproducibility. You can create custom Docker containers that are 
configured with the versions of every piece of software and every library you used in your 
study. This would enable other developers to recreate the environment you used, then 
reproduce your work, and will help you reproduce your own results. In hapter 16, you'll use 
Docker to download and execute a container that’s preconfigured for you to code and run big 


data Spark applications using Jupyter Notebooks. 


SPECIAL FEATURE: IBM WATSON ANALYTICS AND 
COGNITIVE COMPUTING 


Early in our research for this book, we recognized the rapidly growing interest in IBM’s 
Watson. We investigated competitive services and found Watson’s “no credit card required” 


policy for its “free tiers” to be among the most friendly for our readers. 


IBM Watson is a cognitive-computing platform being employed across a wide range of 
real-world scenarios. Cognitive-computing systems simulate the pattern-recognition and 
decision-making capabilities of the human brain to “learn” as they consume more 

data. * ss We include a significant hands-on Watson treatment. We use the free Watson 
Developer Cloud: Python SDK, which provides APIs that enable you to interact with 
Watson’s services programmatically. Watson is fun to use and a great platform for letting 
your creative juices flow. You’ll demo or use the following Watson APIs: Conversation, 
Discovery, Language Translator, Natural Language Classifier, Natural Language 
Understanding, Personality Insights, Speech to Text, Text to Speech, Tone 
Analyzer and Visual Recognition. 


3 ttp://whatis.techtarget.com/definition/cognitive-computing. 
4 ttps://en.wikipedia. org/wiki/Cognitive_ computing. 


5 ttps://www.forbes.com/sites/bernardmarr/2016/03/23/what-everyone- 


hould-know-about-cognitive-computing. 


Watson’s Lite Tier Services and a Cool Watson Case Study 


IBM encourages learning and experimentation by providing free lite tiers for many of its 
APIs. °In hapter 13, you'll try demos of many Watson services. ” Then, you'll use the lite 
tiers of Watson’s Text to Speech, Speech to Text and Translate services to implement a 
“traveler’s assistant” translation app. You'll speak a question in English, then the app 
will transcribe your speech to English text, translate the text to Spanish and speak the 
Spanish text. Next, you'll speak a Spanish response (in case you don’t speak Spanish, we 
provide an audio file you can use). Then, the app will quickly transcribe the speech to Spanish 
text, translate the text to English and speak the English response. Cool stuff! 


Always check the latest terms on IBM’s website, as the terms and services may change. 


Z ttps://console.bluemix.net/catalog/. 


TEACHING APPROACH 


Python for Programmers contains a rich collection of examples drawn from many fields. 
You'll work through interesting, real-world examples using real-world datasets. The book 


concentrates on the principles of good software engineering and stresses program 


clarity. 


Using Fonts for Emphasis 


We place the key terms and the index’s page reference for each defining occurrence in bold 
text for easier reference. We refer to on-screen components in the bold Helvetica font (for 


example, the File menu) and use the Lucida font for Python code (for example, x = 5). 


Syntax Coloring 


For readability, we syntax color all the code. Our syntax-coloring conventions are as follows: 


comments appear in green 

keywords appear in dark blue 

constants and literal values appear in light blue 
errors appear in red 


all other code appears in black 


538 Code Examples 


The book’s 538 examples contain approximately 4000 lines of code. This is a relatively 
small amount for a book this size and is due to the fact that Python is such an expressive 
language. Also, our coding style is to use powerful class libraries to do most of the work 


wherever possible. 


160 Tables/Illustrations/Visualizations 


We include abundant tables, line drawings, and static, dynamic and interactive visualizations. 


Programming Wisdom 


We integrate into the discussions programming wisdom from the authors’ combined nine 


decades of programming and teaching experience, including: 


e Good programming practices and preferred Python idioms that help you produce 


clearer, more understandable and more maintainable programs. 
¢ Common programming errors to reduce the likelihood that you'll make them. 


e Error-prevention tips with suggestions for exposing bugs and removing them from 
your programs. Many of these tips describe techniques for preventing bugs from getting 
into your programs in the first place. 


e Performance tips that highlight opportunities to make your programs run faster or 


minimize the amount of memory they occupy. 


e Software engineering observations that highlight architectural and design issues for 
proper software construction, especially for larger systems. 


SOFTWARE USED IN THE BOOK 


The software we use is available for Windows, macOS and Linux and is free for download 
from the Internet. We wrote the book’s examples using the free Anaconda Python 
distribution. It includes most of the Python, visualization and data science libraries you'll 
need, as well as the [Python interpreter, Jupyter Notebooks and Spyder, considered one of 
the best Python data science IDEs. We use only [Python and Jupyter Notebooks for program 
development in the book. The Before You Begin section following this Preface discusses 


installing Anaconda and a few other items you'll need for working with our examples. 
PYTHON DOCUMENTATION 
You'll find the following documentation especially helpful as you work through the book: 


e The Python Language Reference: 


ttps://docs.python.org/3/reference/index.html 


e The Python Standard Library: 
ttps://docs.python.org/3/library/index.html 
e Python documentation list: 


ttps://docs.python.org/3/ 


GETTING YOUR QUESTIONS ANSWERED 


Popular Python and general programming online forums include: 
e python-forum.io 
e ttps://www.dreamincode.net/forums/forum/29-python/ 


e StackOverflow.com 


Also, many vendors provide forums for their tools and libraries. Many of the libraries you'll 
use in this book are managed and maintained at github.com. Some library maintainers 
provide support through the Issues tab on a given library’s GitHub page. If you cannot find 


an answer to your questions online, please see our web page for the book at 
ttp://www.deitel.com 8 


8Our website is undergoing a major upgrade. If you do not find something you need, please 


write to us directly at eitel@deitel.com. 


GETTING JUPYTER HELP 


Jupyter Notebooks support is provided through: 


e Project Jupyter Google Group: 
ttps://groups.google.com/forum/#!forum/jupyter 

e Jupyter real-time chat room: 
ttps://gitter.im/jupyter/jupyter 

e GitHub 


ttps://github.com/jupyter/help 


StackOverflow: 
ttps://stackoverflow.com/questions/tagged/jupyter 
e Jupyter for Education Google Group (for instructors teaching with Jupyter): 


ttps://groups.google.com/forum/#!forum/jupyter-education 


SUPPLEMENTS 


To get the most out of the presentation, you should execute each code example in parallel 


with reading the corresponding discussion in the book. On the book’s web page at 
ttp://www.deitel.com 
we provide: 


e Downloadable Python source code (. py files) and Jupyter Notebooks (. ipynb 


files) for the book’s code examples. 


e Getting Started videos showing how to use the code examples with IPython and 
Jupyter Notebooks. We also introduce these tools in ection 1.5. 


e Blog posts and book updates. 


For download instructions, see the Before You Begin section that follows this Preface. 


KEEPING IN TOUCH WITH THE AUTHORS 


For answers to questions or to report an error, send an e-mail to us at 
eitel@deitel.com 
or interact with us via social media: 
e Facebook® ( ttp://www.deitel.com/deitelfan) 
e Twitter” (@deitel) 
° LinkedIn® ( ttp://linkedin.com/company/deitel-&-associates) 


e YouTube” ( ttp://youtube.com/DeitelTV) 
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Before You Begin 


This section contains information you should review before using this book. We'll post 


updates at: http: //www.deitel.com. 


FONT AND NAMING CONVENTIONS 


We show Python code and commands and file and folder names in a sans-serif 





font, and on-screen components, such as menu names, in a bold sans-serif font. 


We use italics for emphasis and bold occasionally for strong emphasis. 


GETTING THE CODE EXAMPLES 


You can download the examples. zip file containing the book’s examples from our 


Python for Programmers web page at: 
http://www.deitel.com 


Click the Download Examples link to save the file to your local computer. Most web 
browsers place the file in your user account’s Downloads folder. When the download 
completes, locate it on your system, and extract its examples folder into your user 


account’s Documents folder: 


e Windows: C: \Users\YourAccount\ Documents\examples 


e macOS or Linux: ~/Documents/examples 


Most operating systems have a built-in extraction tool. You also may use an archive tool 


such as 7-Zip (www. 7-zip.org) or WinZip (www. winzip.com). 


STRUCTURE OF THE EXAMPLES FOLDER 


You'll execute three kinds of examples in this book: 


e Individual code snippets in the IPython interactive environment. 
e Complete applications, which are known as scripts. 


e Jupyter Notebooks—a convenient interactive, web-browser-based environment in 
which you can write and execute code and intermix the code with text, images and 


video. 


We demonstrate each in ection 1.5’s test drives. 


The examples folder contains one subfolder per chapter. These are named ch##, 
where ## is the two-digit chapter number 01 to 16—for example, ch01. Except for 


hapters 13, 5 and 6, each chapter’s folder contains the following items: 


e snippets ipynb—A folder containing the chapter’s Jupyter Notebook files. 





e snippets py—A folder containing Python source code files in which each code 
snippet we present is separated from the next by a blank line. You can copy and 


paste these snippets into [Python or into new Jupyter Notebooks that you create. 


e Script files and their supporting files. 


hapter 13 contains one application. hapters15 and 6 explain where to find the files 


you need in the ch15 and ch16 folders, respectively. 


INSTALLING ANACONDA 


We use the easy-to-install Anaconda Python distribution with this book. It comes with 


almost everything you'll need to work with our examples, including: 


e the [Python interpreter, 
e most of the Python and data science libraries we use, 


e a local Jupyter Notebooks server so you can load and execute our notebooks, and 


various other software packages, such as the Spyder Integrated Development 
Environment (IDE)—we use only [Python and Jupyter Notebooks in this book. 


Download the Python 3.x Anaconda installer for Windows, macOS or Linux from: 


ttps://www.anaconda.com/download/ 


When the download completes, run the installer and follow the on-screen instructions. 


To ensure that Anaconda runs correctly, do not move its files after you install it. 


UPDATING ANACONDA 


Next, ensure that Anaconda is up to date. Open a command-line window on your 


system as follows: 


e On macOS, open a Terminal from the Applications folder’s Utilities subfolder. 


e On Windows, open the Anaconda Prompt from the start menu. When doing this 
to update Anaconda (as you'll do here) or to install new packages (discussed 
momentarily), execute the Anaconda Prompt as an administrator by right- 
clicking, then selecting More > Run as administrator. (If you cannot find the 
Anaconda Prompt in the start menu, simply search for it in the Type here to 


search field at the bottom of your screen.) 


e On Linux, open your system’s Terminal or shell (this varies by Linux distribution). 


In your system’s command-line window, execute the following commands to update 


Anaconda’s installed packages to their latest versions: 


1. conda update conda 


2. conda update --all 


PACKAGE MANAGERS 


The conda command used above invokes the conda package manager—one of the 
two key Python package managers you'll use in this book. The other is pip. Packages 
contain the files required to install a given Python library or tool. Throughout the book, 
youll use conda to install additional packages, unless those packages are not available 
through conda, in which case youl use pip. Some people prefer to use pip exclusively 
as it currently supports more packages. If you ever have trouble installing a package 


with conda, try pip instead. 


INSTALLING THE PROSPECTOR STATIC CODE 
ANALYSIS TOOL 


ou may want to analyze you Python code using the Prospector analysis tool, which 
checks your code for common errors and helps you improve it. To install Prospector 
and the Python libraries it uses, run the following command in the command-line 


window: 


pip install prospector 


INSTALLING JUPYTER-MATPLOTLIB 


We implement several animations using a visualization library called Matplotlib. To use 
them in Jupyter Notebooks, you must install a tool called ipymp1. In the Terminal, 
Anaconda Command Prompt or shell you opened previously, execute the following 


commands * one ata time: 


ttps://github.com/matplotlib/jupyter-matplotlib. 


conda install -c conda-forge ipympl 
conda install nodejs 
jupyter labextension install @jupyter-widgets/jupyterlab-manager 





jupyter labextension install jupyter-matplotlib 


INSTALLING THE OTHER PACKAGES 


Anaconda comes with approximately 300 popular Python and data science packages for 
you, such as NumPy, Matplotlib, pandas, Regex, BeautifulSoup, requests, Bokeh, SciPy, 
SciKit-Learn, Seaborn, Spacy, sqlite, statsmodels and many more. The number of 
additional packages you'll need to install throughout the book will be small and we'll 
provide installation instructions as necessary. As you discover new packages, their 


documentation will explain how to install them. 


GET A TWITTER DEVELOPER ACCOUNT 


If you intend to use our “Data Mining Twitter” chapter and any Twitter-based examples 
in subsequent chapters, apply for a Twitter developer account. Twitter now requires 
registration for access to their APIs. To apply, fill out and submit the application at 


ttps://developer.twitter.com/en/apply-for-access 


Twitter reviews every application. At the time of this writing, personal developer 


accounts were being approved immediately and company-account applications were 


aking from several days to several weeks. Approval is not guaranteed. 


INTERNET CONNECTION REQUIRED IN SOME 
CHAPTERS 


While using this book, you’ll need an Internet connection to install various additional 
Python libraries. In some chapters, you'll register for accounts with cloud-based 
services, mostly to use their free tiers. Some services require credit cards to verify your 
identity. In a few cases, you'll use services that are not free. In these cases, you'll take 
advantage of monetary credits provided by the vendors so you can try their services 
without incurring charges. Caution: Some cloud-based services incur costs 
once you set them up. When you complete our case studies using such 


services, be sure to promptly delete the resources you allocated. 


SLIGHT DIFFERENCES IN PROGRAM OUTPUTS 


When you execute our examples, you might notice some differences between the results 


we show and your own results: 


e Due to differences in how calculations are performed with floating-point numbers 
(like -123.45, 7.5 or 0.0236937) across operating systems, you might see minor 
variations in outputs—especially in digits far to the right of the decimal point. 


e When we show outputs that appear in separate windows, we crop the windows to 


remove their borders. 


GETTING YOUR QUESTIONS ANSWERED 


Online forums enable you to interact with other Python programmers and get your 


Python questions answered. Popular Python and general programming forums include: 


e python-forum.io 
e StackOverflow.com 


e ttps://www.dreamincode.net/forums/forum/29-python/ 


Also, many vendors provide forums for their tools and libraries. Most of the libraries 
youll use in this book are managed and maintained at github.com. Some library 


maintainers provide support through the Issues tab on a given library’s GitHub page. 


f you cannot find an answer to your questions online, please see our web page for the 
book at 


ttp://www.deitel.com ° 


* Our website is undergoing a major upgrade. If you do not find something you need, 


please write to us directly at eitel@deitel.com. 


Youre now ready to begin reading Python for Programmers. We hope you enjoy the 
book! 


1. Introduction to Computers and Python 


Objectives 

In this chapter you'll: 

E Learn about exciting recent developments in computing. 

m Review object-oriented programming basics. 

m Understand the strengths of Python. 

m Be introduced to key Python and data-science libraries you'll use in this book. 

m Test-drive the IPython interpreter’s interactive mode for executing Python code. 
m Execute a Python script that animates a bar chart. 

m Create and test-drive a web-browser-based Jupyter Notebook for executing Python code. 
m Learn how big “big data” is and how quickly it’s getting even bigger. 

m Read a big-data case study on a popular mobile navigation app. 


m Be introduced to artificial intelligence—at the intersection of computer science and data 


science. 


Outline 


.1 Introduction 

.2 A Quick Review of Object Technology Basics 
.3 Python 

.4 It’s the Libraries! 

.4.1 Python Standard Library 


.4.2 Data-Science Libraries 


.5 Test-Drives: Using IPython and Jupyter Notebooks 

.5.1 Using IPython Interactive Mode as a Calculator 

.5.2 Executing a Python Program Using the IPython Interpreter 
5.3 Writing and Executing Code in a Jupyter Notebook 

.6 The Cloud and the Internet of Things 

.6.1 The Cloud 

.6.2 Internet of Things 

-7 How Big Is Big Data? 

.7.1 Big Data Analytics 

.7.2 Data Science and Big Data Are Making a Difference: Use Cases 
.8 Case Study—A Big-Data Mobile Application 

-9 Intro to Data Science: Artificial Intelligence—at the Intersection of CS and Data Science 


.10 Wrap-Up 


1.1 INTRODUCTION 


Welcome to Python—one of the world’s most widely used computer programming languages 
and, according to the Popularity of Programming Languages (PYPL) Index, the world’s 


most popular. * 
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ttps://pypl.github.io/PYPL.html (as of January 2019). 
Here, we introduce terminology and concepts that lay the groundwork for the Python 


programming you'll learnin hapters 2- o and the big-data, artificial-intelligence and cloud- 


based case studies we present in hapters 11- 6. 


We'll review object-oriented programming terminology and concepts. You'll learn why 
Python has become so popular. We'll introduce the Python Standard Library and various 
data-science libraries that help you avoid “reinventing the wheel.” You'll use these libraries to 
create software objects that you'll interact with to perform significant tasks with modest 


numbers of instructions. 


Next, you'll work through three test-drives showing how to execute Python code: 


e Inthe first, you'll use [Python to execute Python instructions interactively and 


immediately see their results. 


e Inthe second, you'll execute a substantial Python application that will display an 
animated bar chart summarizing rolls of a six-sided die as they occur. You'll see the “ aw 
f Large Numbers” in action. In hapter 6, you'll build this application with the 
Matplotlib visualization library. 


e Inthe last, we'll introduce Jupyter Notebooks using JupyterLab—an interactive, web- 
browser-based tool in which you can conveniently write and execute Python instructions. 
Jupyter Notebooks enable you to include text, images, audios, videos, animations and 


code. 


In the past, most computer applications ran on standalone computers (that is, not networked 

together). Today’s applications can be written with the aim of communicating among the 

world’s billions of computers via the Internet. We'll introduce the Cloud and the Internet of 

Things (IoT), laying the groundwork for the contemporary applications you'll develop in 
hapters 11- 6. 


You'll learn just how big “big data” is and how quickly it’s getting even bigger. Next, we'll 
present a big-data case study on the Waze mobile navigation app, which uses many current 
technologies to provide dynamic driving directions that get you to your destination as quickly 
and as safely as possible. As we walk through those technologies, we'll mention where you'll 
use many of them in this book. The chapter closes with our first Intro to Data Science section 
in which we discuss a key intersection between computer science and data science—artificial 


intelligence. 


1.2 A QUICK REVIEW OF OBJECT TECHNOLOGY BASICS 


As demands for new and more powerful software are soaring, building software quickly, 
correctly and economically is important. Objects, or more precisely, the classes objects come 
from, are essentially reusable software components. There are date objects, time objects, 
audio objects, video objects, automobile objects, people objects, etc. Almost any noun can be 
reasonably represented as a software object in terms of attributes (e.g., name, color and size) 
and behaviors (e.g., calculating, moving and communicating). Software-development groups 
can use a modular, object-oriented design-and-implementation approach to be much more 
productive than with earlier popular techniques like “structured programming.” Object- 


oriented programs are often easier to understand, correct and modify. 


Automobile as an Object 


To help you understand objects and their contents, let’s begin with a simple analogy. Suppose 
you want to drive a car and make it go faster by pressing its accelerator pedal. What must 
happen before you can do this? Well, before you can drive a car, someone has to design it. A 


car typically begins as engineering drawings, similar to the blueprints that describe the 


esign of a house. These drawings include the design for an accelerator pedal. The pedal 
hides from the driver the complex mechanisms that make the car go faster, just as the brake 
pedal “hides” the mechanisms that slow the car, and the steering wheel “hides” the 
mechanisms that turn the car. This enables people with little or no knowledge of how 


engines, braking and steering mechanisms work to drive a car easily. 


Just as you cannot cook meals in the blueprint of a kitchen, you cannot drive a car’s 
engineering drawings. Before you can drive a car, it must be built from the engineering 
drawings that describe it. A completed car has an actual accelerator pedal to make it go 
faster, but even that’s not enough—the car won’t accelerate on its own (hopefully!), so the 


driver must press the pedal to accelerate the car. 


Methods and Classes 


Let’s use our car example to introduce some key object-oriented programming concepts. 
Performing a task in a program requires a method. The method houses the program 
statements that perform its tasks. The method hides these statements from its user, just as 
the accelerator pedal of a car hides from the driver the mechanisms of making the car go 
faster. In Python, a program unit called a class houses the set of methods that perform the 
class’s tasks. For example, a class that represents a bank account might contain one method 
to deposit money to an account, another to withdraw money from an account and a third to 
inquire what the account’s balance is. A class is similar in concept to a car’s engineering 


drawings, which house the design of an accelerator pedal, steering wheel, and so on. 


Instantiation 


Just as someone has to build a car from its engineering drawings before you can drive a car, 
you must build an object of a class before a program can perform the tasks that the class’s 
methods define. The process of doing this is called instantiation. An object is then referred to 


as an instance of its class. 


Reuse 


Just as a car’s engineering drawings can be reused many times to build many cars, you can 
reuse a class many times to build many objects. Reuse of existing classes when building new 
classes and programs saves time and effort. Reuse also helps you build more reliable and 
effective systems because existing classes and components often have undergone extensive 
testing, debugging and performance tuning. Just as the notion of interchangeable parts was 
crucial to the Industrial Revolution, reusable classes are crucial to the software revolution 


that has been spurred by object technology. 


In Python, you'll typically use a building-block approach to create your programs. To avoid 
reinventing the wheel, you'll use existing high-quality pieces wherever possible. This software 


reuse is a key benefit of object-oriented programming. 


Messages and Method Calls 


hen you drive a car, pressing its gas pedal sends a message to the car to perform a task— 
that is, to go faster. Similarly, you send messages to an object. Each message is implemented 
as a method call that tells a method of the object to perform its task. For example, a program 


might call a bank-account object’s deposit method to increase the account’s balance. 


Attributes and Instance Variables 


A car, besides having capabilities to accomplish tasks, also has attributes, such as its color, its 
number of doors, the amount of gas in its tank, its current speed and its record of total miles 
driven (i.e., its odometer reading). Like its capabilities, the car’s attributes are represented as 
part of its design in its engineering diagrams (which, for example, include an odometer and a 
fuel gauge). As you drive an actual car, these attributes are carried along with the car. Every 
car maintains its own attributes. For example, each car knows how much gas is in its own gas 


tank, but not how much is in the tanks of other cars. 


An object, similarly, has attributes that it carries along as it’s used in a program. These 
attributes are specified as part of the object’s class. For example, a bank-account object has a 
balance attribute that represents the amount of money in the account. Each bank-account 
object knows the balance in the account it represents, but not the balances of the other 
accounts in the bank. Attributes are specified by the class’s instance variables. A class’s 
(and its object’s) attributes and methods are intimately related, so classes wrap together their 


attributes and methods. 


Inheritance 


A new class of objects can be created conveniently by inheritance—the new class (called the 
subclass) starts with the characteristics of an existing class (called the superclass), 
possibly customizing them and adding unique characteristics of its own. In our car analogy, 
an object of class “convertible” certainly is an object of the more general class “automobile,” 


but more specifically, the roof can be raised or lowered. 


Object-Oriented Analysis and Design (OOAD) 


Soon you'll be writing programs in Python. How will you create the code for your programs? 
Perhaps, like many programmers, you'll simply turn on your computer and start typing. This 
approach may work for small programs (like the ones we present in the early chapters of the 
book), but what if you were asked to create a software system to control thousands of 
automated teller machines for a major bank? Or suppose you were asked to work on a team 
of 1,000 software developers building the next generation of the U.S. air traffic control 
system? For projects so large and complex, you should not simply sit down and start writing 


programs. 


To create the best solutions, you should follow a detailed analysis process for determining 
your project’s requirements (i.e., defining what the system is supposed to do), then 
develop a design that satisfies them (i.e., specifying how the system should do it). Ideally, 


you'd go through this process and carefully review the design (and have your design reviewed 


y other software professionals) before writing any code. If this process involves analyzing 
and designing your system from an object-oriented point of view, it’s called an object- 
oriented analysis-and-design (OOAD) process. Languages like Python are object- 
oriented. Programming in such a language, called object-oriented programming (OOP), 


allows you to implement an object-oriented design as a working system. 


1.3 PYTHON 


Python is an object-oriented scripting language that was released publicly in 1991. It was 
developed by Guido van Rossum of the National Research Institute for Mathematics and 


Computer Science in Amsterdam. 


Python has rapidly become one of the world’s most popular programming languages. It’s now 


2 


particularly popular for educational and scientific computing, ~ and it recently surpassed the 


4 
3, 55 


7. 
Here are some reasons why Python is popular and everyone should consider learning it: ° ° 
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programming language R as the most popular data-science programming language. 





ttps://www.oreilly.com/ideas/5-things-to-watch-in-python-in-2017. 








3 





ttps://www.kdnuggets.com/2017/08/python-overtakes-r-leader- 


nalytics-data-science.html. 


4 ttps://www.r-bloggers.com/data-science-job-report-2017-r-passes- 





as-but-python-leaves-them-both-behind/. 





ttps://www.oreilly.com/ideas/5-things-to-watch-in-python-in-2017. 





ttos://dbader.org/blog/why-learn-python. 
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ttps://simpleprogrammer.com/2017/01/18/7-reasons-why-you-should- 








earn-python/. 


8 ttps://www.oreilly.com/ideas/5-things-to-watch-in-python-in-2017. 


e It’s open source, free and widely available with a massive open-source community. 


e It’s easier to learn than languages like C, C++, C# and Java, enabling novices and 


professional developers to get up to speed quickly. 
e It’s easier to read than many other popular programming languages. 
e It’s widely used in education. ° 


° Tollervey, N., Python in Education: Teach, Learn, Program (O'Reilly Media, Inc., 
2015). 


It enhances developer productivity with extensive standard libraries and third-party 
open-source libraries, so programmers can write code faster and perform complex tasks 


with minimal code. We'll say more about this in ection 1.4. 
There are massive numbers of free open-source Python applications. 
It’s popular in web development (e.g., Django, Flask). 


It supports popular programming paradigms—procedural, functional-style and object- 
oriented. ° We'll begin introducing functional-style programming features in hapter 4 


and use them in subsequent chapters. 


2 ttps://en.wikipedia.org/wiki/Python (programming language). 








It simplifies concurrent programming—with asyncio and async/await, you're able to write 
single-threaded concurrent code, greatly simplifying the inherently complex processes of 


writing, debugging and maintaining that code. ° 





ttps://docs.python.org/3/library/asyncio.html. 








ttos://www.oreilly.com/ideas/5-things-to-watch-in-python-in- 


O17. 





There are lots of capabilities for enhancing Python performance. 


It’s used to build anything from simple scripts to complex apps with massive numbers of 


users, such as Dropbox, YouTube, Reddit, Instagram and Quora. ° 

















3 ttps://www.hartmannsoftware.com/Blog/Articles from Software Fans/Mos 





amous-Software-Programs-Written-in-Python. 





It’s popular in artificial intelligence, which is enjoying explosive growth, in part because of 


its special relationship with data science. 
It’s widely used in the financial community. 4 


4Kolanovic, M. and R. Krishnamachari, Big Data and AI Strategies: Machine Learning 


and Alternative Data Approach to Investing (J.P. Morgan, 2017). 


There’s an extensive job market for Python programmers across many disciplines, 


especially in data-science--oriented positions, and Python jobs are among the highest 


paid of all programming jobs. ® £ 





5 ttps://www.infoworld.com/article/3170838/developer/get-paid-10- 





rogramming-languages-to-learn-in-2017.html. 
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ttps://medium.com/@ChallengeRocket/top-10-of-programming— 


anguages-with-the-highest-salaries-in-2017-4390f468256e. 








e Risa popular open-source programming language for statistical applications and 


visualization. Python and R are the two most widely data-science languages. 


Anaconda Python Distribution 


We use the Anaconda Python distribution because it’s easy to install on Windows, macOS 
and Linux and supports the latest versions of Python, the IPython interpreter (introduced in 
ection 1.5.1) and Jupyter Notebooks (introduced in ection 1.5.3). Anaconda also includes 

other software packages and libraries commonly used in Python programming and data 
science, allowing you to focus on Python and data science, rather than software installation 
issues. The IPython interpreter 7 has features that help you explore, discover and experiment 
with Python, the Python Standard Library and the extensive set of third-party libraries. 


7 ttps://ipython.org/. 


Zen of Python 


We adhere to Tim Peters’ The Zen of Python, which summarizes Python creator Guido van 
Rossum’s design principles for the language. This list can be viewed in IPython with the 
command import this. The Zen of Python is defined in Python Enhancement Proposal 
(PEP) 20. “A PEP is a design document providing information to the Python community, or 


sai $ š 8 
describing a new feature for Python or its processes or environment.” 


3 ttps://www.python.org/dev/peps/pep-0001/. 


1.4 IT?’ S THE LIBRARIES! 


Throughout the book, we focus on using existing libraries to help you avoid “reinventing the 
wheel,” thus leveraging your program-development efforts. Often, rather than developing lots 
of original code—a costly and time-consuming process—you can simply create an object of a 
pre-existing library class, which takes only a single Python statement. So, libraries will help 
you perform significant tasks with modest amounts of code. In this book, you'll use a broad 


range of Python standard libraries, data-science libraries and third-party libraries. 


1.4.1 Python Standard Library 


The Python Standard Library provides rich capabilities for text/binary data processing, 
mathematics, functional-style programming, file/directory access, data persistence, data 
compression/archiving, cryptography, operating-system services, concurrent programming, 
interprocess communication, networking protocols, JSON/XML/other Internet data formats, 
multimedia, internationalization, GUI, debugging, profiling and more. The following table 


lists some of the Python Standard Library modules that we use in examples. 


ome of the Python Standard Library modules we use in the book 


collections—Additional data 
structures beyond lists, tuples, 


dictionaries and sets. 


csv—Processing comma-separated value 


files. 


datetime, time—Date and time 


manipulations. 


decimal—Fixed-point and floating-point 
arithmetic, including monetary 


calculations. 


doctest—Simple unit testing via 
validation tests and expected results 


embedded in docstrings. 


json—JavaScript Object Notation (JSON) 
processing for use with web services and 
NoSQL document databases. 


math—Common math constants and 


operations. 


1.4.2 Data-Science Libraries 


os—Interacting with the operating 


system. 


queue—First-in, first-out data 


structure. 
random—Pseudorandom numbers. 


re—Regular expressions for pattern 


matching. 


sqlite3—SQLite relational database 


access. 


statistics—Mathematical statistics 
functions like mean, median, mode 


and variance. 
string—String processing. 


sys—Command-line argument 
processing; standard input, standard 


output and standard error streams. 


timeit—Performance analysis. 


Python has an enormous and rapidly growing community of open-source developers in many 


fields. One of the biggest reasons for Python’s popularity is the extraordinary range of open- 


source libraries developed by its open-source community. One of our goals is to create 


examples and implementation case studies that give you an engaging, challenging and 


entertaining introduction to Python programming, while also involving you in hands-on data 


science, key data-science libraries and more. You'll be amazed at the substantial tasks you 


can accomplish in just a few lines of code. The following table lists various popular data- 


science libraries. You'll use many of these as you work through our data-science examples. 


For visualization, we'll use Matplotlib, Seaborn and Folium, but there are many more. For a 


nice summary of Python visualization libraries see ttp://pyviz.org/. 


opular Python libraries used in data science 


Scientific Computing and Statistics 


NumPy (Numerical Python)—Python does not have a built-in array data structure. 
It uses lists, which are convenient but relatively slow. NumPy provides the high- 
performance ndarray data structure to represent lists and matrices, and it also 


provides routines for processing such data structures. 


SciPy (Scientific Python)—Built on NumPy, SciPy adds routines for scientific 
processing, such as integrals, differential equations, additional matrix processing 


and more. scipy.org controls SciPy and NumPy. 


StatsModels—Provides support for estimations of statistical models, statistical 


tests and statistical data exploration. 


Data Manipulation and Analysis 


Pandas—An extremely popular library for data manipulations. Pandas makes 
abundant use of NumPy’s ndarray. Its two key data structures are Series (one 


dimensional) and DataFrames (two dimensional). 


Visualization 


Matplotlib—A highly customizable visualization and plotting library. Supported 


plots include regular, scatter, bar, contour, pie, quiver, grid, polar axis, 3D and text. 


Seaborn—A higher-level visualization library built on Matplotlib. Seaborn adds a 
nicer look-and-feel, additional visualizations and enables you to create visualizations 


with less code. 


Machine Learning, Deep Learning and Reinforcement Learning 


scikit-learn—Top machine-learning library. Machine learning is a subset of AI. 


Deep learning is a subset of machine learning that focuses on neural networks. 
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ensorFlow (Google), CNTK (Microsoft’s cognitive toolkit for deep learning) or 


Theano (Université de Montréal). 


TensorFlow—From Google, this is the most widely used deep learning library. 
TensorFlow works with GPUs (graphics processing units) or Google’s custom TPUs 
(Tensor processing units) for performance. TensorFlow is important in AI and big 
data analytics—where processing demands are huge. You'll use the version of Keras 


that’s built into TensorFlow. 


OpenAI Gym—A library and environment for developing, testing and comparing 


reinforcement-learning algorithms. 


Natural Language Processing (NLP) 


NLTK (Natural Language Toolkit)—Used for natural language processing (NLP) 
tasks. 


TextBlob—An object-oriented NLP text-processing library built on the NLTK and 
pattern NLP libraries. TextBlob simplifies many NLP tasks. 


Gensim—Similar to NLTK. Commonly used to build an index for a collection of 
documents, then determine how similar another document is to each of those in the 


index. 


1.5 TEST-DRIVES: USING IPYTHON AND JUPYTER 
NOTEBOOKS 


In this section, you'll test-drive the IPython interpreter ? in two modes: 


Before reading this section, follow the instructions in the Before You Begin section to install 
the Anaconda- Python distribution, which contains the [Python interpreter. 


e In interactive mode, you'll enter small bits of Python code called snippets and 


immediately see their results. 


e In script mode, you'll execute code loaded from a file that has the . py extension (short 


for Python). Such files are called scripts or programs, and they’re generally longer than 


the code snippets you'll use in interactive mode. 


Then, you'll learn how to use the browser-based environment known as the Jupyter Notebook 


for writing and executing Python code. ° 


°Jupyter supports many programming languages by installing their “kernels.” For more 


information see ttps://github.com/jupyter/jupyter/wiki/Jupyter-kernels. 


1.5.1 Using IPython Interactive Mode as a Calculator 


Let’s use [Python interactive mode to evaluate simple arithmetic expressions. 


Entering IPython in Interactive Mode 


First, open a command-line window on your system: 


e On macOS, open a Terminal from the Applications folder’s Utilities subfolder. 
e On Windows, open the Anaconda Command Prompt from the start menu. 


e On Linux, open your system’s Terminal or shell (this varies by Linux distribution). 


In the command-line window, type ipython, then press Enter (or Return). You'll see text 


like the following, this varies by platform and by [Python version: 


lick here to view code image 





Python 3.7.0 | packaged by conda-forge | (default, Jan 20 20197 aie Age 2) 
Type 'copyright', 'credits' or 'license' for more information 
TPython 6.5.0 == An enhanced Interactive Python. Type '?' 
for help. 
iio RNE 
The text "In [1] :" is a prompt, indicating that IPython is waiting for your input. You can 





type ? for help or begin entering snippets, as you’ll do momentarily. 


Evaluating Expressions 


In interactive mode, you can evaluate expressions: 


Tn a AD 72 
Ou seas 


Eom 2e 


After you type 45 + 72 and press Enter, IPython reads the snippet, evaluates it and prints 





its result in Out [1].* Then IPython displays the In [2] prompt to show that it’s waiting for 


you to enter your second snippet. For each new snippet, [Python adds 1 to the number in the 





square brackets. Each In [1] prompt in the book indicates that we’ve started a new 


interactive session. We generally do that for each new section of a chapter. 
In the next chapter, youll see that there are some cases in which Out [] is not displayed. 
Let’s evaluate a more complex expression: 


lick here to view code image 
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Python uses the asterisk (*) for multiplication and the forward slash (/) for division. As in 
mathematics, parentheses force the evaluation order, so the parenthesized expression (12.7 
- 4) evaluates first, giving 8.7. Next,5 * 8.7 evaluates giving 43.5.Then, 43.5 / 2 
evaluates, giving the result 21 . 75, which IPython displays in Out [2]. Whole numbers, like 
5, 4 and 2, are called integers. Numbers with decimal points, like 12.7, 43.5 and 21.75, 


are called floating-point numbers. 


Exiting Interactive Mode 


To leave interactive mode, you can: 





e Type the exit command at the current In [] prompt and press Enter to exit 


immediately. 


e Type the key sequence <Ctrl> + d (or <control> + d). This displays the prompt "Do you 
really want to exit ([y]/n) ?". The square brackets around y indicate that it’s 


the default response—pressing Enter submits the default response and exits. 


e Type <Ctrl> + d (or <control> + d) twice (macOS and Linux only). 


1.5.2 Executing a Python Program Using the IPython Interpreter 


In this section, you'll execute a script named Rol 1DieDynamic. py that youl write in 
hapter 6. The . py extension indicates that the file contains Python source code. The script 
RollDieDynamic. py simulates rolling a six-sided die. It presents a colorful animated 


visualization that dynamically graphs the frequencies of each die face. 


Changing to This Chapter’s Examples Folder 


You'll find the script in the book’s ch01 source-code folder. In the Before You Begin section 


you extracted the examples folder to your user account’s Documents folder. Each chapter 


has a folder containing that chapter’s source code. The folder is named ch##, where ## is a 
two-digit chapter number from 01 to 17. First, open your system’s command-line window. 


Next, use the cd (“change directory”) command to change to the ch01 folder: 


e On macOS/Linux, type cd ~/Documents/examples/ch01, then press Enter. 


e On Windows, type cd C:\Users\YourAccount\Documents\examples\ch01, then 


press Enter. 


Executing the Script 


To execute the script, type the following command at the command line, then press Enter: 
ipython RollDieDynamic.py 6000 1 


The script displays a window, showing the visualization. The numbers 6000 and 1 tell this 
script the number of times to roll dice and how many dice to roll each time. In this case, we'll 


update the chart 6000 times for 1 die at a time. 


For a six-sided die, the values 1 through 6 should each occur with “equal likelihood”—the 
probability of each is 1/ 6" or about 16.667%. If we roll a die 6000 times, we’d expect about 
1000 of each face. Like coin tossing, die rolling is random, so there could be some faces with 
fewer than 1000, some with 1000 and some with more than 1000. We took the screen 
captures below during the script’s execution. This script uses randomly generated die values, 
so your results will differ. Experiment with the script by changing the value 1 to 100, 1000 
and 10000. Notice that as the number of die rolls gets larger, the frequencies zero in on 
16.667%. This is a phenomenon of the “ aw of Large Numbers.” 


Roll the dice 6000 times and roll | die each time: 
ipython Rol1lDieDynamic.py 6000 1 
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Creating Scripts 


Typically, you create your Python source code in an editor that enables you to type text. Using 
the editor, you type a program, make any necessary corrections and save it to your computer. 


Integrated development environments (IDEs) provide tools that support the entire 


software-development process, such as editors, debuggers for locating logic errors that cause 
programs to execute incorrectly and more. Some popular Python IDEs include Spyder (which 


comes with Anaconda), PyCharm and Visual Studio Code. 


Problems That May Occur at Execution Time 


Programs often do not work on the first try. For example, an executing program might try to 
divide by zero (an illegal operation in Python). This would cause the program to display an 
error message. If this occurred in a script, you'd return to the editor, make the necessary 
corrections and re-execute the script to determine whether the corrections fixed the 


problem(s). 


Errors such as division by zero occur as a program runs, so they’re called runtime errors or 
execution-time errors. Fatal runtime errors cause programs to terminate immediately 
without having successfully performed their jobs. Non-fatal runtime errors allow 


programs to run to completion, often producing incorrect results. 


1.5.3 Writing and Executing Code in a Jupyter Notebook 


The Anaconda Python Distribution that you installed in the Before You Begin section comes 
with the Jupyter Notebook—an interactive, browser-based environment in which you can 
write and execute code and intermix the code with text, images and video. Jupyter Notebooks 
are broadly used in the data-science community in particular and the broader scientific 
community in general. They’re the preferred means of doing Python-based data analytics 
studies and reproducibly communicating their results. The Jupyter Notebook environment 


supports a growing number of programming languages. 


For your convenience, all of the book’s source code also is provided in Jupyter Notebooks 
that you can simply load and execute. In this section, you'll use the JupyterLab interface, 
which enables you to manage your notebook files and other files that your notebooks use (like 
images and videos). As you'll see, JupyterLab also makes it convenient to write code, execute 


it, see the results, modify the code and execute it again. 


You'll see that coding in a Jupyter Notebook is similar to working with IPython—in fact, 
Jupyter Notebooks use IPython by default. In this section, you'll create a notebook, add the 


code from ection 1.5.1 to it and execute that code. 


Opening JupyterLab in Your Browser 


To open JupyterLab, change to the ch01 examples folder in your Terminal, shell or 
Anaconda Command Prompt (as in ection 1.5.2), type the following command, then press 
Enter (or Return): 


jupyter lab 


This executes the Jupyter Notebook server on your computer and opens JupyterLab in your 


default web browser, showing the ch01 folder’s contents in the File Browser tab 


at the left side of the JupyterLab interface: 


File Edit View Run Kernel Tabs Settings Help 
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Python 3 
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Python 3 
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The Jupyter Notebook server enables you to load and run Jupyter Notebooks in your web 
browser. From the JupyterLab Files tab, you can double-click files to open them in the right 
side of the window where the Launcher tab is currently displayed. Each file you open 
appears as a separate tab in this part of the window. If you accidentally close your browser, 


you can reopen JupyterLab by entering the following address in your web browser 


http://localhost:8888/lab 


Creating a New Jupyter Notebook 


In the Launcher tab under Notebook, click the Python 3 button to create a new Jupyter 
Notebook named Untitled. ipynb in which you can enter and execute Python 3 code. The 
file extension .ipynb is short for [Python Notebook—the original name of the Jupyter 
Notebook. 


Renaming the Notebook 


Rename Untitled. ipynb as TestDrive.ipynb: 


1. Right-click the Untitled. ipynb tab and select Rename Notebook. 


2. Change the name to TestDrive. ipynb and click RENAME. 


The top of JupyterLab should now appear as follows: 


E File Edit View Run Kernel Tabs Settings Help 





ba + t Cc E Launcher x | A TestDriveipynb =X | 
ft a+ kX 0 © >» m C Code {v Python3 O 
Name A: Last Modified 
IR] TestDrive.ipynb 2 minutes ago | CJ: 

@ @ RollDieDynamic.py 7 months ago 


Evaluating an Expression 


The unit of work in a notebook is a cell in which you can enter code snippets. By default, a 
new notebook contains one cell—the rectangle in the Test Drive. ipynb notebook—but you 
can add more. To the cell’s left, the notation [ ] : is where the Jupyter Notebook will display 


the cell’s snippet number after you execute the cell. Click in the cell, then type the expression 


AS 2 


To execute the current cell’s code, type Ctrl + Enter (or control + Enter). JupyterLab executes 
the code in IPython, then displays the results below the cell: 


z File Edit View Run Kernel Tabs Settings Help 


Be + + Cc EX Launcher 53 | F TestDrive.ipyno © | 

h @+ Xk O O >» m C Code ~y Python3 O 
Ñ Name A Last Modified 

| oo: Bie 
@ @ RollDieDynamic.py 7 months ago l [1]: 117 


Adding and Executing Another Cell 


Let’s evaluate a more complex expression. First, click the + button in the toolbar above the 


notebook’s first cell—this adds a new cell below the current one: 


< File Edit View Run Kernel Tabs Settings Help 


Ba + t Cc EJ Launcher X WA TestDrive.ipynb e 
A BB + Xx 6 © >» m C Code v Python3 O 
& Name a Last Modified 
IT p [1]: 45 + 72 
R] TestDrive.ipynb 2 minutes ago 
@ @ RollDieDynamic.py 7 months ago [1]: 117 
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Click in the new cell, then type the expression 


S E 


and execute the cell by typing Ctrl + Enter (or control + Enter): 


= File Edit View Run Kernel Tabs Settings Help 


m T t Cc E Launcher Xx 1S TestDrive.ipynb x 
ft @+X O © > m C Code ~y Python3 O 
Ñ Name a Last Modified 
ae [1]: 45 + 72 
IR] TestDrive.ipynb seconds ago = 
@ @ RollDieDynamic.py 7 months ago [1]: 117 
| i2: |5 * (12.7 = 4) 72 
i [2]: 21.75 
Saving the Notebook 


If your notebook has unsaved changes, the X in the notebook’s tab will change to . To save 
the notebook, select the File menu in JupyterLab (not at the top of your browser’s window), 
then select Save Notebook. 


Notebooks Provided with Each Chapter’s Examples 


For your convenience, each chapter’s examples also are provided as ready-to-execute 
notebooks without their outputs. This enables you to work through them snippet-by-snippet 


and see the outputs appear as you execute each snippet. 


So that we can show you how to load an existing notebook and execute its cells, let’s reset the 
TestDrive. ipynb notebook to remove its output and snippet numbers. This will return it 
to a state like the notebooks we provide for the subsequent chapters’ examples. From the 
Kernel menu select Restart Kernel and Clear All Outputs..., then click the RESTART 
button. The preceding command also is helpful whenever you wish to re-execute a notebook’s 


snippets. The notebook should now appear as follows: 


a F t C E Launcher X F| TestDrive.ipynb x 
a Ba + X 06 & > m œC Code v Python3 O 
- Name A Last Modified 
> []: 45 + 72 
[IR] TestDrive.ipynb a minute ago = ee 
@ @ RollDieDynamic.py 7 months ago [ C 1: (beeen 


From the File menu, select Save Notebook, then click the Test Drive. ipynb tab’s X 


button to close the notebook. 


Opening and Executing an Existing Notebook 


When you launch JupyterLab from a given chapter’s examples folder, you'll be able to open 
notebooks from that folder or any of its subfolders. Once you locate a specific notebook, 
double-click it to open it. Open the Test Drive. ipynb notebook again now. Once a 
notebook is open, you can execute each cell individually, as you did earlier in this section, or 


you can execute the entire notebook at once. To do so, from the Run menu select Run All 


Cells. The notebook will execute the cells in order, displaying each cell’s output below that 


cell. 


Closing JupyterLab 


When you're done with JupyterLab, you can close its browser tab, then in the Terminal, shell 
or Anaconda Command Prompt from which you launched JupyterLab, type Ctrl + c (or 


control + c) twice. 


JupyterLab Tips 
While working in JupyterLab, you might find these tips helpful: 


e Ifyou need to enter and execute many snippets, you can execute the current cell and add 
a new one below it by typing Shift + Enter, rather than Ctrl + Enter (or control + Enter). 


e As you get into the later chapters, some of the snippets you'll enter in Jupyter Notebooks 
will contain many lines of code. To display line numbers within each cell, select Show 


line numbers from JupyterLab’s View menu. 


More Information on Working with JupyterLab 


JupyterLab has many more features that you'll find helpful. We recommend that you read the 
Jupyter team’s introduction to JupyterLab at: 


ttps://jupyterlab.readthedocs.io/en/stable/index.html 





For a quick overview, click Overview under GETTING STARTED. Also, under USER 
GUIDE read the introductions to The JupyterLab Interface, Working with Files, Text 
Editor and Notebooks for many additional features. 


1.6 THE CLOUD AND THE INTERNET OF THINGS 


1.6.1 The Cloud 


More and more computing today is done “in the cloud”—that is, distributed across the 
Internet worldwide. Many apps you use daily are dependent on cloud-based services that 
use massive clusters of computing resources (computers, processors, memory, disk drives, 
etc.) and databases that communicate over the Internet with each other and the apps you use. 
A service that provides access to itself over the Internet is known as a web service. As you'll 
see, using cloud-based services in Python often is as simple as creating a software object and 
interacting with it. That object then uses web services that connect to the cloud on your 
behalf. 


Throughout the hapters 11- 6 examples, you'll work with many cloud-based services: 


e In hapters12 and 6, you'll use Twitter’s web services (via the Python library Tweepy) to 
get information about specific Twitter users, search for tweets from the last seven days 


and receive streams of tweets as they occur—that is, in real time. 


e In haptersi1and 2, you'll use the Python library TextBlob to translate text between 
languages. Behind the scenes, TextBlob uses the Google Translate web service to perform 


those translations. 


e In hapter 13, you'll use the IBM Watson’s Text to Speech, Speech to Text and Translate 
services. You'll implement a traveler’s assistant translation app that enables you to speak 
a question in English, transcribes the speech to text, translates the text to Spanish and 
speaks the Spanish text. The app then allows you to speak a Spanish response (in case you 
don’t speak Spanish, we provide an audio file you can use), transcribes the speech to text, 
translates the text to English and speaks the English response. Via IBM Watson demos, 


you'll also experiment with many other Watson cloud-based services in hapter 13. 


e In hapter 16, you'll work with Microsoft Azure’s HDInsight service and other Azure web 
services as you implement big-data applications using Apache Hadoop and Spark. Azure 


is Microsoft’s set of cloud-based services. 


e In hapter 16, you'll use the Dweet.io web service to simulate an Internet-connected 
thermostat that publishes temperature readings online. You'll also use a web-based 
service to create a “dashboard” that visualizes the temperature readings over time and 


warns you if the temperature gets too low or too high. 


e In hapter 16, you'll use a web-based dashboard to visualize a simulated stream of live 
sensor data from the PubNub web service. You'll also create a Python app that visualizes a 


PubNub simulated stream of live stock-price changes. 


In most cases, you'll create Python objects that interact with web services on your behalf, 


hiding the details of how to access these services over the Internet. 


Mashups 


The applications-development methodology of mashups enables you to rapidly develop 
powerful software applications by combining (often free) complementary web services and 
other forms of information feeds—as you'll do in our IBM Watson traveler’s assistant 
translation app. One of the first mashups combined the real-estate listings provided by 


ttp://www.craigslist.org with the mapping capabilities of Google Maps to offer 





maps that showed the locations of homes for sale or rent in a given area. 


ProgrammableWeb ( ttp://www.programmableweb.com/) provides a directory of over 
20,750 web services and almost 8,000 mashups. They also provide how-to guides and sample 
code for working with web services and creating your own mashups. According to their 


website, some of the most widely used web services are Facebook, Google Maps, Twitter and 


ouTube. 


1.6.2 Internet of Things 


The Internet is no longer just a network of computers—it’s an Internet of Things (IoT). A 
thing is any object with an IP address and the ability to send, and in some cases receive, data 


automatically over the Internet. Such things include: 


e acar with a transponder for paying tolls, 

e monitors for parking-space availability in a garage, 

e a heart monitor implanted in a human, 

e water quality monitors, 

e asmart meter that reports energy usage, 

e radiation detectors, 

e item trackers in a warehouse, 

e mobile apps that can track your movement and location, 


e smart thermostats that adjust room temperatures based on weather forecasts and activity 


in the home, and 


e intelligent home appliances. 


According to statista.com, there are already over 23 billion IoT devices in use today, and 


there could be over 75 billion IoT devices in 2025. ° 
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ttps://www.statista.com/statistics/471264/iot-number-of-connected- 


evices-worldwide/. 


1.7 HOW BIG IS BIG DATA? 


For computer scientists and data scientists, data is now as important as writing programs. 
According to IBM, approximately 2.5 quintillion bytes (2.5 exabytes) of data are created 
daily, ë and 90% of the world’s data was created in the last two years. 4 According to IDC, the 
global data supply will reach 175 zettabytes (equal to 175 trillion gigabytes or 175 billion 
terabytes) annually by 2025. ° Consider the following examples of various popular data 


measures. 





3 ttps://www.ibm.com/blogs/watson/2016/06/welcome-to-the-world-of-a- 
fs 


4 





ttps://public.dhe.ibm.com/common/ssi/ecm/wr/en/wrl12345usen/watson- 
customer-engagement--watson-marketing-wr-other-papers-and-reports-— 


r112345usen-20170719.pdf. 
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ttps://www.networkworld.com/article/3325397/storage/idc-expect- 








75-zettabytes-of-data-worldwide-by-2025.html. 


egabytes (MB) 


One megabyte is about one million (actually 2°) bytes. Many of the files we use on a daily 


basis require one or more MBs of storage. Some examples include: 


e MP3 audio files—High-quality MP3s range from 1 to 2.4 MB per minute. © 





$ ttps://www.audiomountain.com/tech/audio-file-size.html. 


e Photos—JPEG format photos taken on a digital camera can require about 8 to 10 MB per 
photo. 


e Video—Smartphone cameras can record video at various resolutions. Each minute of 
video can require many megabytes of storage. For example, on one of our iPhones, the 
Camera settings app reports that 1080p video at 30 frames-per-second (FPS) requires 
130 MB/minute and 4K video at 30 FPS requires 350 MB/minute. 


Gigabytes (GB) 


One gigabyte is about 1000 megabytes (actually 2°° bytes). A dual-layer DVD can store up to 
8.5 GB 7, which translates to: 





7 ttps://en.wikipedia.org/wiki/DVD. 


e as much as 141 hours of MP3 audio, 
e approximately 1000 photos from a 16-megapixel camera, 
e approximately 7.7 minutes of 1080p video at 30 FPS, or 


e approximately 2.85 minutes of 4K video at 30 FPS. 


The current highest-capacity Ultra HD Blu-ray discs can store up to 100 GB of video. ® 


Streaming a 4K movie can use between 7 and 10 GB per hour (highly compressed). 
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ttps://en.wikipedia.org/wiki/Ultra_HD Blu-ray. 


Terabytes (TB) 


One terabyte is about 1000 gigabytes (actually 2*° bytes). Recent disk drives for desktop 


omputers come in sizes up to 15 TB, °? which is equivalent to: 





? ttps://www.zdnet.com/article/worlds-biggest-hard-drive-meet- 





estern-digitals-15tb-monster/. 





e approximately 28 years of MP3 audio, 
e approximately 1.68 million photos from a 16-megapixel camera, 
e approximately 226 hours of 1080p video at 30 FPS and 


e approximately 84 hours of 4K video at 30 FPS. 


Nimbus Data now has the largest solid-state drive (SSD) at 100 TB, which can store 6.67 


times the 15-TB examples of audio, photos and video listed above. ° 


° ttps://www.cinema5d.com/nimbus-data-100tb-ssd-worlds-largest-ssd/. 


Petabytes, Exabytes and Zettabytes 


There are nearly four billion people online creating about 2.5 quintillion bytes of data each 
day *—that’s 2500 petabytes (each petabyte is about 1000 terabytes) or 2.5 exabytes (each 
exabyte is about 1000 petabytes). According to a March 2016 Analytics Week article, within 
five years there will be over 50 billion devices connected to the Internet (most of them 
through the Internet of Things, which we discuss in ections 1.6.2 and 6.8) and by 2020 
we'll be producing 1.7 megabytes of new data every second for every person on the planet. * 


At today’s numbers (approximately 7.7 billion people °), that’s about 
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ttps://public.dhe.ibm.com/common/ssi/ecm/wr/en/wrl12345usen/watson- 
customer-engagement--watson-marketing-wr-other-papers-and-reports-— 


r112345usen-20170719.pdf. 

















ttps://analyticsweek.com/content/big-data-facts/. 














ttps://en.wikipedia.org/wiki/World population. 


e 13 petabytes of new data per second, 
e 780 petabytes per minute, 
e 46,800 petabytes (46.8 exabytes) per hour and 


e 1,123 exabytes per day—that’s 1.123 zettabytes (ZB) per day (each zettabyte is about 1000 
exabytes). 


That’s the equivalent of over 5.5 million hours (over 600 years) of 4K video every day or 


pproximately 116 billion photos every day! 


Additional Big-Data Stats 


For an entertaining real-time sense of big data, check out 
ttps://www.internetlivestats.com, with various statistics, including the numbers 


so far today of 


e Google searches. 
e Tweets. 
e Videos viewed on YouTube. 


e Photos uploaded on Instagram. 


You can click each statistic to drill down for more information. For instance, they say over 


250 billion tweets were sent in 2018. 


Some other interesting big-data facts: 


e Every hour, YouTube users upload 24,000 hours of video, and almost 1 billion hours of 


video are watched on YouTube every day. 4 
4 ttps://www.brandwatch.com/blog/youtube-stats/. 


e Every second, there are 51,773 GBs (or 51.773 TBs) of Internet traffic, 7894 tweets sent, 
64,332 Google searches and 72,029 YouTube videos viewed. ° 


3 ttp://www.internetlivestats.com/one-second. 


e On Facebook each day there are 800 million “likes,” 6 60 million emojis are sent, ” and 
there are over two billion searches of the more than 2.5 trillion Facebook posts since the 


i i : 8 
site’s inception. 
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ttps://newsroom. fb.com/news/2017/06/two-billion-people-coming- 





ogether-on-facebook. 





7 ttps://mashable.com/2017/07/17/facebook-world-emoji-day/. 

















ttps://techcrunch.com/2016/07/27/facebook-will-make-you-talk/. 


e In June 2017, Will Marshall, CEO of Planet, said the company has 142 satellites that 
image the whole planet’s land mass once per day. They add one million images and seven 
TBs of new data each day. Together with their partners, they’re using machine learning on 


that data to improve crop yields, see how many ships are in a given port and track 


eforestation. With respect to Amazon deforestation, he said: “Used to be we’d wake up 


after a few years and there’s a big hole in the Amazon. Now we can literally count every 


tree on the planet every day.” ? 





? ttps://www. bloomberg. com/news/videos/2017-06-30/learning-from- 





lanet-s-shoe-boxed-sized-satellites-—video, June 30, 2017. 


Domo, Inc. has a nice infographic called “Data Never Sleeps 6.0” showing how much data is 


generated every minute, including: ° 


° ttps://www.domo.com/learn/data-never-sleeps-6. 


e 473,400 tweets sent. 

e 2,083,333 Snapchat photos shared. 

e 97,222 hours of Netflix video viewed. 
e 12,986,111 million text messages sent. 
e 49,380 Instagram posts. 

e 176,220 Skype calls. 

e 750,000 Spotify songs streamed. 

e 3,877,140 Google searches. 


® 4,333,560 YouTube videos watched. 


Computing Power Over the Years 


Data is getting more massive and so is the computing power for processing it. The 
performance of today’s processors is often measured in terms of FLOPS (floating-point 
operations per second). In the early to mid-1990s, the fastest supercomputer speeds were 
measured in gigaflops (10° FLOPS). By the late 1990s, Intel produced the first teraflop (10° 
FLOPS) supercomputers. In the early-to-mid 2000s, speeds reached hundreds of teraflops, 
then in 2008, IBM released the first petaflop (10° FLOPS) supercomputer. Currently, the 
fastest supercomputer—the IBM Summit, located at the Department of Energy’s (DOE) Oak 
Ridge National Laboratory (ORNL)—is capable of 122.3 peta-flops. * 


1 


ttps://en.wikipedia.org/wiki/FLOPS. 





Distributed computing can link thousands of personal computers via the Internet to produce 
even more FLOPS. In late 2016, the Folding@home network—a distributed network in which 


people volunteer their personal computers’ resources for use in disease research and drug 


esign “—was capable of over 100 petaflops. 3 Companies like IBM are now working toward 


supercomputers capable of exaflops (10'° FLOPS). 4 





ttps://en.wikipedia.org/wiki/Folding@home. 








ttps://en.wikipedia.org/wiki/FLOPS. 











4 








ttps://www.ibm.com/blogs/research/2017/06/supercomputing-weather- 


odel-exascale/. 


The quantum computers now under development theoretically could operate at 
18,000,000,000,000,000,000 times the speed of today’s “conventional computers”! ° This 
number is so extraordinary that in one second, a quantum computer theoretically could do 
staggeringly more calculations than the total that have been done by all computers since the 
world’s first computer appeared. This almost unimaginable computing power could wreak 
havoc with blockchain-based cryptocurrencies like Bitcoin. Engineers are already rethinking 


blockchain to prepare for such massive increases in computing power. ° 








5 ttps://medium.com/@n.biedrzycki/only-god-can-count-that-fast-the- 

















orld-of-quantum-computing-406a0a91lfcf4. 
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ttps://singularityhub.com/2017/11/05/is-quantum-computing-an- 





xistential-threat-to-blockchain-technology/. 


The history of supercomputing power is that it eventually works its way down from research 
labs, where extraordinary amounts of money have been spent to achieve those performance 
numbers, into “reasonably priced” commercial computer systems and even desktop 


computers, laptops, tablets and smartphones. 


Computing power’s cost continues to decline, especially with cloud computing. People used 
to ask the question, “How much computing power do I need on my system to deal with my 
peak processing needs?” Today, that thinking has shifted to “Can I quickly carve out on the 
cloud what I need temporarily for my most demanding computing chores?” You pay for only 


what you use to accomplish a given task. 


Processing the World’s Data Requires Lots of Electricity 


Data from the world’s Internet-connected devices is exploding, and processing that data 
requires tremendous amounts of energy. According to a recent article, energy use for 
processing data in 2015 was growing at 20% per year and consuming approximately three to 
five percent of the world’s power. The article says that total data-processing power 


consumption could reach 20% by 2025. 7 





7 ttps://www.theguardian.com/environment/2017/dec/11/tsunami-of- 

















ata-could-consume--fifth-global-electricity-by-2025. 


nother enormous electricity consumer is the blockchain-based cryptocurrency Bitcoin. 
Processing just one Bitcoin transaction uses approximately the same amount of energy as 
powering the average American home for a week! The energy use comes from the process 


Bitcoin “miners” use to prove that transaction data is valid. ° 


ttps://motherboard.vice.com/en_ us/article/ywbbpm/bitcoin-mining- 





lectricity-consumption--ethereum-energy-climate-change. 


According to some estimates, a year of Bitcoin transactions consumes more energy than 
many countries. ? Together, Bitcoin and Ethereum (another popular blockchain-based 
platform and cryptocurrency) consume more energy per year than Israel and almost as much 


as Greece. ° 








ttps://digiconomist.net/bitcoin-energy-consumption. 




















ttps://digiconomist.net/ethereum-energy-consumption. 


Morgan Stanley predicted in 2018 that “the electricity consumption required to create 


cryptocurrencies this year could actually outpace the firm’s projected global electric vehicle 


» 1 


demand—in 2025.” © This situation is unsustainable, especially given the huge interest in 


blockchain-based applications, even beyond the cryptocurrency explosion. The blockchain 


community is working on fixes. > 3 





ttps://www.morganstanley.com/ideas/cryptocurrencies-global- 


tilities. 
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ttps://www.technologyreview.com/s/609480/bitcoin-uses-massive- 








mounts-of-energybut-theres-a-plan-to-fix-it/. 


3 ttp://mashable.com/2017/12/01/bitcoin-energy/. 





Big-Data Opportunities 


The big-data explosion is likely to continue exponentially for years to come. With 50 billion 
computing devices on the horizon, we can only imagine how many more there will be over the 
next few decades. It’s crucial for businesses, governments, the military and even individuals 


to get a handle on all this data. 


It’s interesting that some of the best writings about big data, data science, artificial 
intelligence and more are coming out of distinguished business organizations, such as J.P. 
Morgan, McKinsey and more. Big data’s appeal to big business is undeniable given the 
rapidly accelerating accomplishments. Many companies are making significant investments 
and getting valuable results through technologies in this book, such as big data, machine 
learning, deep learning and natural-language processing. This is forcing competitors to invest 
as well, rapidly increasing the need for computing professionals with data-science and 


computer science experience. This growth is likely to continue for many years. 


.7.1 Big Data Analytics 


Data analytics is a mature and well-developed academic and professional discipline. The term 
“data analysis” was coined in 1962, * though people have been analyzing data using statistics 
for thousands of years going back to the ancient Egyptians. ° Big data analytics is a more 


recent phenomenon—the term “big data” was coined around 2000. Í 





4 ttps://www.forbes.com/sites/gilpress/2013/05/28/a-very-short- 





istory-of-data-science/. 














ttps://www.flydata.com/blog/a-brief-history-of-data-analysis/. 
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ttps://bits.blogs.nytimes.com/2013/02/01/the-origins-of-big-data- 








n-etymological--detective-story/. 


Consider four of the V’s of big data ” 8: 











7 ttps://www.ibmbigdatahub.com/infographic/four-vs-big-data. 
There are lots of articles and papers that add many other V-words to this list. 


1. Volume—the amount of data the world is producing is growing exponentially. 


2. Velocity—the speed at which that data is being produced, the speed at which it moves 
through organizations and the speed at which data changes are growing quickly. > >+ 


? ttps://www.zdnet.com/article/volume-velocity-and-variety- 





nderstanding-the-three-vs-of-big-data/. 


o 








ttps://whatis.techtarget.com/definition/3Vs. 





* ttps://www.forbes.com/sites/brentdykes/2017/06/28/big-data- 





orget-volume-and-variety--focus-on-velocity. 


3. Variety—data used to be alphanumeric (that is, consisting of alphabetic characters, digits, 
punctuation and some special characters)—today it also includes images, audios, videos 
and data from an exploding number of Internet of Things sensors in our homes, 


businesses, vehicles, cities and more. 


4. Veracity—the validity of the data—is it complete and accurate? Can we trust that data 


when making crucial decisions? Is it real? 


Most data is now being created digitally in a variety of types, in extraordinary volumes and 
moving at astonishing velocities. Moore’s Law and related observations have enabled us to 
store data economically and to process and move it faster—and all at rates growing 


exponentially over time. Digital data storage has become so vast in capacity, cheap and small 


that we can now conveniently and economically retain all the digital data we’re creating. 7 
That’s big data. 
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ttp://www.lesk.com/mlesk/ksg97/ksg.html. [The following article pointed us to 
this Michael Lesk article: 





ttps://www.forbes.com/sites/gilpress/2013/05/28/a-very-short- 





istory-of-data-science/.] 





The following Richard W. Hamming quote—although from 1962—sets the tone for the rest of 
this book: 


“The purpose of computing is insight, not numbers.” * 

3Hamming, R. W., Numerical Methods for Scientists and Engineers (New York, NY., 
McGraw Hill, 1962). [The following article pointed us to Hammings book and his quote that 
we cited: ttps://www.forbes.com/sites/gilpress/2013/05/28/a-very- 

















hort-history-of-data-science/.] 





Data science is producing new, deeper, subtler and more valuable insights at a remarkable 
pace. It’s truly making a difference. Big data analytics is an integral part of the answer. We 
address big data infrastructure in hapter 16 with hands-on case studies on NoSQL 
databases, Hadoop MapReduce programming, Spark, real-time Internet of Things (IoT) 


stream programming and more. 


To get a sense of big data’s scope in industry, government and academia, check out the high- 


resolution graphic. 4 You can click to zoom for easier readability: 


4Turck, M., and J. Hao, Great Power, Great Responsibility: The 2018 Big Data & AI 
Landscape, ttp://mattturck.com/bigdata2018/. 


ttp://mattturck.com/wp-content/uploads/2018/07/Matt_ Turck FirstMark Big Data_L 





4 | > 








.7.2 Data Science and Big Data Are Making a Difference: Use Cases 


The data-science field is growing rapidly because it’s producing significant results that are 
making a difference. We enumerate data-science and big data use cases in the following table. 
We expect that the use cases and our examples throughout the book will inspire you to 
pursue new use cases in your career. Big-data analytics has resulted in improved profits, 
better customer relations, and even sports teams winning more games and championships 
while spending less on players. ” oT 

°Sawchik, T., Big Data Baseball: Math, Miracles, and the End of a 20-Year Losing Streak 
(New York, Flat Iron Books, 2015). 


Sayres, I., Super Crunchers (Bantam Books, 2007), pp. 710. 


7Lewis, M., Moneyball: The Art of Winning an Unfair Game (W. W. Norton & Company, 
2004). 


ata-science use 


cases 





anomaly detection 


assisting people 
with disabilities 


auto-insurance risk 


prediction 
automated closed 
captioning 
automated image oT 
i > predicting weather- 
captions T 
sensitive product sales 
automated investin ; ws 
8 facial recognition predictive analyties 
autonomous ships A 
p fitness tracking preventative medicine 
brain mappin . 
PERE fraud detection preventing disease 
caller identification game playing outbreaks 
, reading sign language 
cancan genomics and healthcare 
diagnosis/treatment ; 
5 / ; i real-estate valuation 
Geographic Information Systems 
carbon emissions 
| (GIS) recommendation 
reduction 
systems 
GPS Systems 
classifyin , i 
me z ; reducing overbooking 
handwriting health outcome improvement 
a i a ; ride sharing 
computer vision hospital readmission reduction 
: f ; risk minimization 
credit scoring human genome sequencing 
f O f f ; robo financial advisors 
crime: predicting identity-theft prevention 
locations 


security enhancements 


rime: predicting 


recidivism 


crime: predictive 


policing 
crime: prevention 


CRISPR gene 
editing 


crop-yield 


improvement 
customer churn 


customer 


experience 
customer retention 


customer 


satisfaction 
customer service 


customer service 


agents 
customized diets 
cybersecurity 
data mining 

data visualization 


detecting new 


viruses 


diagnosing breast 


Cancer 


diagnosing heart 


disease 


diagnostic medicine 


immunotherapy 
insurance pricing 


intelligent assistants 


Internet of Things (IoT) and 


medical device monitoring 


Internet of Things and weather 


forecasting 

inventory control 
language translation 
location-based services 
loyalty programs 
malware detection 
mapping 

marketing 

marketing analytics 


music generation 


natural-language translation 


new pharmaceuticals 
opioid abuse prevention 
personal assistants 
personalized medicine 
personalized shopping 
phishing elimination 
pollution reduction 
precision medicine 


predicting cancer survival 


self-driving cars 
sentiment analysis 
sharing economy 
similarity detection 
smart cities 

smart homes 

smart meters 

smart thermostats 
smart traffic control 
social analytics 
social graph analysis 
spam detection 
spatial data analysis 


sports recruiting and 


coaching 
stock market forecasting 


student performance 


assessment 
summarizing text 
telemedicine 


terrorist attack 


prevention 

theft prevention 

travel recommendations 
trend spotting 


visual product search 


disaster-victim predicting disease outbreaks voice recognition 


identification 
predicting health outcomes voice search 


drones 
predicting student enrollments weather forecasting 


dynamic driving 


routes 
dynamic pricing 


electronic health 


records 
emotion detection 


energy- 
consumption 


reduction 


1.8 CASE STUDY—A BIG-DATA MOBILE APPLICATION 


Google’s Waze GPS navigation app, with its 90 million monthly active users, € is one of the 
most successful big-data apps. Early GPS navigation devices and apps relied on static maps 
and GPS coordinates to determine the best route to your destination. They could not adjust 


dynamically to changing traffic situations. 


$ ttps://www.waze.com/brands/drivers/. 

Waze processes massive amounts of crowdsourced data—that is, the data that’s 
continuously supplied by their users and their users’ devices worldwide. They analyze this 
data as it arrives to determine the best route to get you to your destination in the least 
amount of time. To accomplish this, Waze relies on your smartphone’s Internet connection. 
The app automatically sends location updates to their servers (assuming you allow it to). 
They use that data to dynamically re-route you based on current traffic conditions and to 
tune their maps. Users report other information, such as roadblocks, construction, obstacles, 
vehicles in breakdown lanes, police locations, gas prices and more. Waze then alerts other 


drivers in those locations. 


Waze uses many technologies to provide its services. We’re not privy to how Waze is 
implemented, but we infer below a list of technologies they probably use. You'll use many of 


these in hapters 11- 6. For example, 


Most apps created today use at least some open-source software. You'll take advantage of 


many open-source libraries and tools throughout this book. 


Waze communicates information over the Internet between their servers and their users’ 
mobile devices. Today, such data often is transmitted in JSON (JavaScript Object 
Notation) format, which we'll introduce in hapter 9 and use in subsequent chapters. The 


JSON data is typically hidden from you by the libraries you use. 


Waze uses speech synthesis to speak driving directions and alerts to you, and speech 
recognition to understand your spoken commands. We use IBM Watson’s speech- 


synthesis and speech-recognition capabilities in hapter 13. 


Once Waze converts a spoken natural-language command to text, it must determine the 
correct action to perform, which requires natural language processing (NLP). We present 


NLP in hapter 11 and use it in several subsequent chapters. 


Waze displays dynamically updated visualizations such as alerts and maps. Waze also 
enables you to interact with the maps by moving them or zooming in and out. We create 
dynamic visualizations with Matplotlib and Seaborn throughout the book, and we display 


interactive maps with Foliumin hapters12 and 6. 


Waze uses your phone as a streaming Internet of Things (IoT) device. Each phone is a 
GPS sensor that continuously streams data over the Internet to Waze. In hapter 16, we 


introduce IoT and work with simulated IoT streaming sensors. 


Waze receives IoT streams from millions of phones at once. It must process, store and 
analyze that data immediately to update your device’s maps, to display and speak relevant 
alerts and possibly to update your driving directions. This requires massively parallel 
processing capabilities implemented with clusters of computers in the cloud. In hapter 
6, we'll introduce various big-data infrastructure technologies for receiving streaming 
data, storing that big data in appropriate databases and processing the data with software 


and hardware that provide massively parallel processing capabilities. 


Waze uses artificial-intelligence capabilities to perform the data-analysis tasks that enable 
it to predict the best routes based on the information it receives. In hapters14 and 5 we 
use machine learning and deep learning, respectively, to analyze massive amounts of data 


and make predictions based on that data. 


Waze probably stores its routing information in a graph database. Such databases can 
efficiently calculate shortest routes. We introduce graph databases, such as Neo4J, in 


hapter 16. 


Many cars are now equipped with devices that enable them to “see” cars and obstacles 
around them. These are used, for example, to help implement automated braking systems 


and are a key part of self-driving car technology. Rather than relying on users to report 


bstacles and stopped cars on the side of the road, navigation apps could take advantage 
of cameras and other sensors by using deep-learning computer-vision techniques to 
analyze images “on the fly” and automatically report those items. We introduce deep 


learning for computer vision in hapter 15. 


1.9 INTRO TO DATA SCIENCE: ARTIFICIAL INTELLIGENCE— 
AT THE INTERSECTION OF CS AND DATA SCIENCE 


When a baby first opens its eyes, does it “see” its parent’s faces? Does it understand any 
notion of what a face is—or even what a simple shape is? Babies must “learn” the world 
around them. That’s what artificial intelligence (AI) is doing today. It’s looking at massive 
amounts of data and learning from it. AI is being used to play games, implement a wide range 
of computer-vision applications, enable self-driving cars, enable robots to learn to perform 
new tasks, diagnose medical conditions, translate speech to other languages in near real time, 
create chatbots that can respond to arbitrary questions using massive databases of 
knowledge, and much more. Who'd have guessed just a few years ago that artificially 
intelligent self-driving cars would be allowed on our roads—or even become common? Yet, 
this is now a highly competitive area. The ultimate goal of all this learning is artificial 
general intelligence—an AI that can perform intelligence tasks as well as humans. This is a 


scary thought to many people. 


Artificial-Intelligence Milestones 


Several artificial-intelligence milestones, in particular, captured people’s attention and 
imagination, made the general public start thinking that AI is real and made businesses think 


about commercializing AI: 


e Ina1i997 match between IBM’s DeepBlue computer system and chess Grandmaster 
Gary Kasparov, DeepBlue became the first computer to beat a reigning world chess 
champion under tournament conditions. ? IBM loaded DeepBlue with hundreds of 
thousands of grandmaster chess games. ° DeepBlue was capable of using brute force to 
evaluate up to 200 million moves per second! * This is big data at work. IBM received the 
Carnegie Mellon University Fredkin Prize, which in 1980 offered $100,000 to the creators 


of the first computer to beat a world chess champion. * 





ttps://en.wikipedia.org/wiki/Deep Blue versus Garry Kasparov. 























ttps://en.wikipedia.org/wiki/Deep Blue (chess computer). 

















ttps://en.wikipedia.org/wiki/Deep Blue (chess computer). 

















ttps://articles.latimes.com/1997/jul/30/news/mn-17696. 





e In 2011, IBM’s Watson beat the two best human Jeopardy! players in a $1 million 


match. Watson simultaneously used hundreds of language-analysis techniques to locate 


orrect answers in 200 million pages of content (including all of Wikipedia) requiring 
four terabytes of storage. * 4 Watson was trained with machine learning and 
reinforcement-learning techniques. ° hapter 13 discusses IBM Watson and 


hapter 14 discusses machine-learning. 


3 ttps://www.techrepublic.com/article/ibm-watson-the-inside- 





story-of-how-the-jeopardy--winning-supercomputer-was-born-and- 





hat-it-wants-to-do-next/. 





ttps://en.wikipedia.org/wiki/Watson (computer). 
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ttps://www.aaai.org/Magazine/Watson/watson.php, AI Magazine, Fall 


2010. 


Go—a board game created in China thousands of years ago S_is widely considered to be 


179 possible board configurations. ” 


one of the most complex games ever invented with 10 
To give you a sense of how large a number that is, it’s believed that there are (only) 
between 107° and 10°” atoms in the known universe! ® ° In 2015, AlphaGo—created by 
Google’s DeepMind group—used deep learning with two neural networks to beat the 
European Go champion Fan Hui. Go is considered to be a far more complex game than 


chess. hapter 15 discusses neural networks and deep learning. 








ttp://www.usgo.org/brief-history-go. 
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ttps://www.pbs.org/newshour/science/google-artificial- 





intelligence-beats-champion--at-worlds-most-complicated-board- 


ame. 





ttps://www.universetoday.com/36302/atoms-in-the-universe/. 











ttps://en.wikipedia.org/wiki/Observable universe#Matter content. 








More recently, Google generalized its AlphaGo AI to create AlphaZero—a game-playing 
AI that teaches itself to play other games. In December 2017, AlphaZero learned the rules 
of and taught itself to play chess in less than four hours using reinforcement learning. It 
then beat the world champion chess program, Stockfish 8, in a 100-game match—winning 
or drawing every game. After training itselfin Go for just eight hours, AlphaZero was able 


to play Go vs. its AlphaGo predecessor, winning 60 of 100 games. ° 


° ttps://www.theguardian.com/technology/2017/dec/07/alphazero- 





google-deepmind-ai-beats-champion-program-teaching-itself-to-play- 


our-hours. 


A Personal Anecdote 


When one of the authors, Harvey Deitel, was an undergraduate student at MIT in the mid- 


960s, he took a graduate-level artificial-intelligence course with Marvin Minsky (to whom 


this book is dedicated), one of the founders of artificial intelligence (AI). Harvey: 


Professor Minsky required a major term project. He told us to think about what intelligence 
is and to make a computer do something intelligent. Our grade in the course would be 


almost solely dependent on the project. No pressure! 


I researched the standardized IQ tests that schools administer to help evaluate their 
students’ intelligence capabilities. Being a mathematician at heart, I decided to tackle the 
popular IQ-test problem of predicting the next number in a sequence of numbers of 
arbitrary length and complexity. I used interactive Lisp running on an early Digital 
Equipment Corporation PDP-1 and was able to get my sequence predictor running on some 
pretty complex stuff, handling challenges well beyond what I recalled seeing on IQ tests. 
Lisp’s ability to manipulate arbitrarily long lists recursively was exactly what I needed to 
meet the project’s requirements. Python offers recursion and generalized list processing 


( hapter 5). 


I tried the sequence predictor on many of my MIT classmates. They would make up number 
sequences and type them into my predictor. The PDP-1 would “think” for a while—often a 


long while—and almost always came up with the right answer. 


Then I hit a snag. One of my classmates typed in the sequence 14, 23, 34 and 42. My 
predictor went to work on it, and the PDP-1 chugged away for a long time, failing to predict 
the next number. I couldn’t get it either. My classmate told me to think about it overnight, 
and he’d reveal the answer the next day, claiming that it was a simple sequence. My efforts 


were to no avail. 


The following day he told me the next number was 57, but I didn’t understand why. So he 
told me to think about it overnight again, and the following day he said the next number 
was 125. That didn't help a bit—I was stumped. He said that the sequence was the numbers 
of the two-way crosstown streets of Manhattan. I cried, “foul,” but he said it met my 
criterion of predicting the next number in a numerical sequence. My world view was 


mathematics—his was broader. 


Over the years, I’ve tried that sequence on friends, relatives and professional colleagues. A 
few who spent time in Manhattan got it right. My sequence predictor needed a lot more 
than just mathematical knowledge to handle problems like this, requiring (a possibly vast) 


world knowledge. 


Watson and Big Data Open New Possibilities 

When Paul and I started working on this Python book, we were immediately drawn to 
IBM’s Watson using big data and artificial-intelligence techniques like natural language 
processing (NLP) and machine learning to beat two of the world’s best human Jeopardy! 
players. We realized that Watson could probably handle problems like the sequence 


predictor because it was loaded with the world’s street maps and a whole lot more. That 


het our appetite for digging in deep on big data and today’s artificial-intelligence 
technologies, and helped shape hapters 11- 6 of this book. 


It’s notable that all of the data-science implementation case studies in hapters 11— 6 either 
are rooted in artificial intelligence technologies or discuss the big data hardware and software 
infrastructure that enables computer scientists and data scientists to implement leading-edge 


Al-based solutions effectively. 


Al: A Field with Problems But No Solutions 


For many decades, AI has been viewed as a field with problems but no solutions. That’s 
because once a particular problem is solved people say, “Well, that’s not intelligence, it’s just 
a computer program that tells the computer exactly what to do.” However, with machine 
learning ( hapter 14) and deep learning ( hapter 15) we’re not pre-programming- solutions 
to specific problems. Instead, we're letting our computers solve problems by learning from 
data—and, typically, lots of it. 


Many of the most interesting and challenging problems are being pursued with deep 
learning. Google alone has thousands of deep-learning projects underway and that number is 
growing quickly. * * As you work through this book, we'll introduce you to many edge-of- 


the-practice artificial intelligence, big data and cloud technologies. 
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ttp://theweek.com/speedreads/654463/google-more-than-1000- 








rtificial-intelligence-projects-works. 





? ttps://www.zdnet.com/article/google-says-exponential-growth-of-ai- 





s-changing-nature-of-compute/. 


1.10 WRAP-UP 


In this chapter, we introduced terminology and concepts that lay the groundwork for the 
Python programming you'll learnin hapters 2— o and the big-data, artificial-intelligence 


and cloud-based case studies we present in hapters 11- 6. 


We reviewed object-oriented programming concepts and discussed why Python has become 
so popular. We introduced the Python Standard Library and various data-science libraries 
that help you avoid “reinventing the wheel.” In subsequent chapters, you'll use these libraries 
to create software objects that you'll interact with to perform significant tasks with modest 


numbers of instructions. 


You worked through three test-drives showing how to execute Python code with the IPython 

interpreter and Jupyter Notebooks. We introduced the Cloud and the Internet of Things 

(IoT), laying the groundwork for the contemporary applications youll develop in hapters 
1- 6. 


We discussed just how big “big data” is and how quickly it’s getting even bigger, and 


resented a big-data case study on the Waze mobile navigation app, which uses many current 
technologies to provide dynamic driving directions that get you to your destination as quickly 
and as safely as possible. We mentioned where in this book you'll use many of those 
technologies. The chapter closed with our first Intro to Data Science section in which we 
discussed a key intersection between computer science and data science—artificial 


intelligence. 


https://avxhm.se/blogs/hillO 


2. Introduction to Python Programming 


Objectives 
In this chapter, you'll: 


E Continue using [Python interactive mode to enter code snippets and see their results 


immediately. 

mw Write simple Python statements and scripts. 
m Create variables to store data for later use. 
mw Become familiar with built-in data types. 


m Use arithmetic operators and comparison operators, and understand their 


precedence. 
m Use single-, double- and triple-quoted strings. 


m Use built-in function print to display text. 





m Use built-in function input to prompt the user to enter data at the keyboard and get 


that data for use in the program. 


Æ Convert text to integer values with built-in function int. 





m Use comparison operators and the if statement to decide whether to execute a 


statement or group of statements. 
m Learn about objects and Python’s dynamic typing. 
m Use built in function type to get an object’s type 


Outline 


.1 Introduction 

.2 Variables and Assignment Statements 

.3 Arithmetic 

.4 Function print and an Intro to Single- and Double-Quoted Strings 
.5 Triple-Quoted Strings 

.6 Getting Input from the User 

-7 Decision Making: The if Statement and Comparison Operators 

.8 Objects and Dynamic Typing 

-9 Intro to Data Science: Basic Descriptive Statistics 


-10 Wrap-Up 


2.1 INTRODUCTION 


In this chapter, we introduce Python programming and present examples illustrating 
key language features. We assume you've read the IPython Test-Drive in hapter 1, 
which introduced the [Python interpreter and used it to evaluate simple arithmetic 


expressions. 


2.2 VARIABLES AND ASSIGNMENT STATEMENTS 


You've used IPython’s interactive mode as a calculator with expressions such as 


Teel, LI ea ii 
Owe AER al abe 


Let’s create a variable named x that stores the integer 7: 


Snippet [2] is a statement. Each statement specifies a task to perform. The preceding 


statement creates x and uses the assignment symbol (=) to give x a value. Most 


statements stop at the end of the line, though it’s possible for statements to span more 
than one line. The following statement creates the variable y and assigns to it the value 
3: 


You can now use the values of x and y in expressions: 


Tea Ee. eee AY 
Out [4]: 10 


Calculations in Assignment Statements 


The following statement adds the values of variables x and y and assigns the result to 


the variable total, which we then display: 


In [PS total = x sy: 


ba Voy total 
Ot Kode 10 


The = symbol is not an operator. The right side of the = symbol always executes first, 
then the result is assigned to the variable on the symbol’s left side. 


Python Style 
The Style Guide for Python Code * helps you write code that conforms to Python’s 


coding conventions. The style guide recommends inserting one space on each side of 


the assignment symbol = and binary operators like + to make programs more readable. 


ttps://www.python.org/dev/peps/pep-0008/. 


Variable Names 


A variable name, such as x, is an identifier. Each identifier may consist of letters, 
digits and underscores (_) but may not begin with a digit. Python is case sensitive, so 
number and Number are different identifiers because one begins with a lowercase letter 


and the other begins with an uppercase letter. 


Types 


ach value in Python has a type that indicates the kind of data the value represents. 


You can view a value’s type with Python’s built-in type function, as in: 
In [7]: type cs) 
Oates A E deri 
In [8]: type (10.5) 


CUE [S| float 


The variable x contains the integer value 7 (from snippet [2]), so Python displays int 
(short for integer). The value 10 . 5 is a floating-point number, so Python displays 
float. 


2.3 ARITHMETIC 


The following table summarizes the arithmetic operators, which include some 


symbols not used in algebra. 





Python Arithmetic Algebraic Python 


expression 


operation operator expression 





Addition - f+7 f +7 
Subtraction = p-c p= © 
Multiplication * b-m b * m 
Exponentiation xe x x ER y 


True division / x/y or = orx =+ y oe) y 


Lx/y J or [š] or 


Floor division Woe cea ey 


Lx +y] 


Remainder 


oe 
Nn 


rmods r 


oe 


(modulo) 


Multiplication (*) 


Python uses the asterisk (*) multiplication operator: 


Abia UP yp eel 
Owe I: 28 


Exponentiation (**) 


The exponentiation (**) operator raises one value to the power of another: 


m eI 2s To 
Out [2l 1024 


To calculate the square root, you can use the exponent 1/2 (that is, 0. 5): 


Bio | (eStores Ce ee) 
Out SSO 


True Division (/) vs. Floor Division (/ /) 


True division (/) divides a numerator by a denominator and yields a floating-point 


number with a decimal point, as in: 


ie AS i ye! 
Gue AA lea 


Floor division (//) divides a numerator by a denominator, yielding the highest 
integer that’s not greater than the result. Python truncates (discards) the fractional 


part: 
ie (Shs aA z 
Outs ayes 1 


EASE Se afgh 
Omit [oi 2 O 





mae [Tes yA e/a 
Oui (i 2 


In true division, -13 divided by 4 gives -3.25: 


Bey [ESA — ho / 4 
Gutke = 352) 


Floor division gives the closest integer that’s not greater than -3 .25—which is -4: 


To NO als) poy! 
Out Los = 4 


Exceptions and Tracebacks 


Dividing by zero with / or // is not allowed and results in an exception—a sign that a 


problem occurred: 


lick here to view code image 


To (IMC) Be tees 7 6) 


ZeroDivisionError Traceback (most recent call last 
ipychon input- 0- cd7s59d3fero 9> any <modudie>() 
----> 1 123 / 0 


ZeroDivisionError: division by zero 











Python reports an exception with a traceback. This traceback indicates that an 
exception of type ZeroDivisionError occurred—most exception names end with 


Error. In interactive mode, the snippet number that caused the exception is specified 


by the 10 in the line 


<ipython-input-10-cd759d3fcf39> in <module>() 


The line that begins with ----> shows the code that caused the exception. Sometimes 
snippets have more than one line of code—the 1 to the right of ----> indicates that 
line 1 within the snippet caused the exception. The last line shows the exception that 
occurred, followed by a colon (:) and an error message with more information about 


the exception: 





ZeroDivisionError: division by zero 


The “Files and Exceptions” chapter discusses exceptions in detail. 


An exception also occurs if you try to use a variable that you have not yet created. The 


following snippet tries to add 7 to the undefined variable z, resulting in a NameError: 


lick here to view code image 


NameError Traceback (most recent call last 
ipython-input-11-f2cdbf4fe75d> in <module>() 
----> 1 z + 7 








NameError: name 'z' is not defined 











Remainder Operator 


Python’s remainder operator (%) yields the remainder after the left operand is 


divided by the right operand: 


TA WIE) ye Bee 2 
Oue [L22 


In this case, 17 divided by 5 yields a quotient of 3 and a remainder of 2. This operator 


is most commonly used with integers, but also can be used with other numeric types: 


mon LSS Has e o 
(Giehediaken| a OLS 


Straight-Line Form 


Algebraic notations such as 


SI 


generally are not acceptable to compilers or interpreters. For this reason, algebraic 
expressions must be typed in straight-line form using Python’s operators. The 
expression above must be written asa / b(ora // b for floor division) so that all 


operators and operands appear in a horizontal straight line. 


Grouping Expressions with Parentheses 


Parentheses group Python expressions, as they do in algebraic expressions. For 


example, the following code multiplies 10 times the quantity 5 + 3: 


Trt Paley ery BO Aa eS) 
Oui [Ay ss 0 


Without these parentheses, the result is different: 


aata ES ae UNOS) se ae} 
Oui MS eos 


The parentheses are redundant (unnecessary) if removing them yields the same 


result. 


Operator Precedence Rules 


Python applies the operators in arithmetic expressions according to the following rules 


of operator precedence. These are generally the same as those in algebra: 


1. Expressions in parentheses evaluate first, so parentheses may force the order of 
evaluation to occur in any sequence you desire. Parentheses have the highest level of 
precedence. In expressions with nested parentheses, suchas (a / (b - c)), 


the expression in the innermost parentheses (that is,b - c) evaluates first. 


2. Exponentiation operations evaluate next. If an expression contains several 


exponentiation operations, Python applies them from right to left. 


3. Multiplication, division and modulus operations evaluate next. If an expression 


contains several multiplication, true-division, floor-division and modulus 
operations, Python applies them from left to right. Multiplication, division and 


modulus are “on the same level of precedence.” 


4. Addition and subtraction operations evaluate last. If an expression contains several 
addition and subtraction operations, Python applies them from left to right. 


Addition and subtraction also have the same level of precedence. 
For the complete list of operators and their precedence (in lowest-to-highest order), see 
ttps://docs.python.org/3/reference/expressions.html#operator-precedence 


Operator Grouping 


When we say that Python applies certain operators from left to right, we are referring to 


the operators’ grouping. For example, in the expression 

el ar le) ar te 
the addition operators (+) group from left to right as if we parenthesized the expression 
as (a + b) + c.All Python operators of the same precedence group left-to-right 


except for the exponentiation operator (* *), which groups right-to-left. 


Redundant Parentheses 


You can use redundant parentheses to group subexpressions to make the expression 


clearer. For example, the second-degree polynomial 


NE pel a ree a 2 ap lon te e 


can be parenthesized, for clarity, as 


lick here to view code image 


Gel NG et ie are a ((loys 2S $4) a e 


Breaking a complex expression into a sequence of statements with shorter, simpler 


expressions also can promote clarity. 


Operand Types 


Each arithmetic operator may be used with integers and floating-point numbers. If both 
operands are integers, the result is an integer—except for the true-division (/) operator, 
which always yields a floating-point number. If both operands are floating-point 
numbers, the result is a floating-point number. Expressions containing an integer and a 
floating-point number are mixed-type expressions—these always produce floating- 


point results. 


2.4 FUNCTION PRINT AND AN INTRO TO SINGLE- AND 
DOUBLE-QUOTED STRINGS 


The built-in print function displays its argument(s) as a line of text: 


lick here to view code image 


In [1]: prine (Welcome to Python!") 
Welcome to Python! 


In this case, the argument 'Welcome to Python!' isastring—a sequence of 
characters enclosed in single quotes ('). Unlike when you evaluate expressions in 
interactive mode, the text that print displays here is not preceded by Out [1]. Also, 
print does not display a string’s quotes, though we’ll soon show how to display quotes 


in strings. 
You also may enclose a string in double quotes ("), as in: 


lick here to view code image 


In [2]: print("Welcome to PERONU) 
Welcome to Python! 


Python programmers generally prefer single quotes. When print completes its task, it 


positions the screen cursor at the beginning of the next line. 


Printing a Comma-Separated List of Items 


The print function can receive a comma-separated list of arguments, as in: 


lick here to view code image 


rn? ie]: printe (Welcome, qto, Python!) 
Welcome to Python! 


t displays each argument separated from the next by a space, producing the same 
output as in the two preceding snippets. Here we showed a comma-separated list of 
strings, but the values can be of any type. We'll show in the next chapter how to prevent 


automatic spacing between values or use a different separator than space. 


Printing Many Lines of Text with One Statement 


When a backslash (\) appears in a string, it’s known as the escape character. The 
backslash and the character immediately following it form an escape sequence. For 
example, \n represents the newline character escape sequence, which tells print to 
move the output cursor to the next line. The following snippet uses three newline 


characters to create several lines of output:i 


lick here to view code image 


In [4]: print('Welcome\nto\n\nPython!"') 
Welcome 


TO 


Python! 


Other Escape Sequences 


The following table shows some common escape sequences. 


Escape 


Description 


sequence 





Insert a newline character in a string. When the string is 
\n displayed, for each newline, move the screen cursor to the 


beginning of the next line. 


Insert a horizontal tab. When the string is displayed, for each tab, 
move the screen cursor to the next tab stop. 


VA Insert a backslash character in a string. 


be Insert a double quote character in a string. 


NA Insert a single quote character in a string. 


Ignoring a Line Break in a Long String 


You may also split a long string (or a long statement) over several lines by using the \ 


continuation character as the last character on a line to ignore the line break: 


lick here to view code image 


TA LS pring (this is a longer string; So we \ 





2 split it over two lines) 


this is a longer string, so we split it over two lines 


The interpreter reassembles the string’s parts into a single string with no line break. 
Though the backslash character in the preceding snippet is inside a string, it’s not the 
escape character because another character does not follow it. 


Printing the Value of an Expression 


Calculations can be performed in print statements: 


lick here to view code image 


TALE printe Sum ake. 7) ae Si) 


Sum as) 10 


2.5 TRIPLE-QUOTED STRINGS 


Earlier, we introduced strings delimited by a pair of single quotes (') or a pair of double 
quotes ("). Triple-quoted strings begin and end with three double quotes (""") or 
three single quotes ('''). The Style Guide for Python Code recommends three double 


quotes ("""). Use these to create: 


e multiline strings, 
e strings containing single or double quotes and 


e docstrings, which are the recommended way to document the purposes of certain 


program components. 


Including Quotes in Strings 


In a string delimited by single quotes, you may include double-quote characters: 


lick here to view code image 


Deval ilps prine Display CRIA in quotes') 
Display “hal” ain quotes 





but not single quotes: 


lick here to view code image 


TAPI- printe Display Niin quotes!) 
Kile <p ychon=inpucaZ—l SoESIeceri2 Aky line 1 
print ('Display “hit an quotes”) 


A 


SyntaxError: invalid syntax 





unless you use the \' escape sequence: 


lick here to view code image 


In [2]: prank ("Display \"hi\" in quotes) 
Display 'hi' in quotes 





Snippet [2] displayed a syntax error due to a single quote inside a single-quoted string. 
IPython displays information about the line of code that caused the syntax error and 


points to the error with a ^ symbol. It also displays the message SyntaxError: 


invalid syntax. 
A string delimited by double quotes may include single quote characters: 


lick here to view code image 


In [4]: print("Display the name O Brven™) 


Display the name O'Brien 


but not double quotes, unless you use the \" escape sequence: 


lick here to view code image 


in [Sie prine (C Das pilav ASTIN um quotes") 
Display "hi” in quotes 





To avoid using \ ' and \" inside strings, you can enclose such strings in triple quotes: 


lick here to view code image 


TA |Cyi|  jonentation(/MIpEbeyoulectye Wn wand Toys" in quotes TS) 
Display "hi” and 'bye' in quotes 


Multiline Strings 


The following snippet assigns a multiline triple-quoted string to 


triple quoted string: 


lick here to view code image 


iste (AB triple quoted stringi = MITRAIS is a triple-quoted 


string that spans two lines”™® 


IPython knows that the string is incomplete because we did not type the closing """ 
before we pressed Enter. So, IPython displays a continuation prompt ...: at which 
you can input the multiline string’s next line. This continues until you enter the ending 


"»"" and press Enter. The following displays triple quoted string: 


lick here to view code image 


tael: printe Gece guoted String) 
This is a triple-quoted 


string that spans two lines 


Python stores multiline strings with embedded newline characters. When we evaluate 
triple quoted string rather than printing it, [Python displays it in single quotes 


with a \n character where you pressed Enter in snippet [7]. The quotes IPython 
displays indicate that triple quoted string is a string—they re not part of the 


string’s contents: 


lick here to view code image 


Tao: triple quoted String 
Outo]: 'This is a triple-quoted\nstring that spans two lines' 


2.6 GETTING INPUT FROM THE USER 


The built-in input function requests and obtains user input: 


lick here to view code image 


In [1]: name = input ("What's your name? ") 


What's your name? Paul 


In [2]: name 
Qutli Paws" 

In [3]: print (name) 
Paul 


The snippet executes as follows: 


e First, input displays its string argument—a prompt—to tell the user what to type 
and waits for the user to respond. We typed Paul and pressed Enter. We use bold 
text to distinguish the user’s input from the prompt text that input displays. 


e Function input then returns those characters as a string that the program can use. 


Here we assigned that string to the variable name. 


Snippet [2] shows name’s value. Evaluating name displays its value in single quotes as 
"Paul" because it’s a string. Printing name (in snippet [3] ) displays the string without 


the quotes. If you enter quotes, they’re part of the string, as in: 


lick here to view code image 


In [4]: name = input ("What's your name? ") 


What's your name? 'Paul' 


In [5]: name 


Owe TSi eal 
In [6]: print (name) 
MPa 


Function input Always Returns a String 


Consider the following snippets that attempt to read two numbers and add them: 


lick here to view code image 





in [7] valued. = input (Enter files number: ') 





Enter first number: 7 


In [8]: value2 = input ('Enter second number: ') 





Enter second number: 3 
In [9]: valuel + value2 


Out Fok: aes st 


Rather than adding the integers 7 and 3 to produce 10, Python “adds” the string values 
'7' and '3', producing the string '73'. This is known as string concatenation. It 
creates a new string containing the left operand’s value followed by the right operand’s 


value. 


Getting an Integer from the User 


If you need an integer, convert the string to an integer using the built-in int function: 


lick here to view code image 





In [10]: value = input('Enter an integer: ') 





Enter an integer: 7 





In [11]: value = int (value) 
Tans value 
Out 2I 7 


We could have combined the code in snippets [10] and [11]: 


lick here to view code image 


ins Pies another value = int Gnput (Enter another integer: ')) 





Enter another integer: 13 


in [14]; another value 
Out [EA ES 


Variables value and another value now contain integers. Adding them produces an 


integer result (rather than concatenating them): 


Da Si value k another value 
Ome ESA 20 


If the string passed to int cannot be converted to an integer, a ValueError occurs: 


lick here to view code image 


En ALG): badi value = int (input (eo! Pmitciens another integer: ')) 





Enter another integer: hello 


ValueError Traceback (most recent call last 
ipython-input-16-cd36e6cf8911> in <module>() 


= e i bad values Int (Input Enter another integer: ')) 








ValueError: invalid literal for ant () wath base 10: “hello: 

















Function int also can convert a floating-point value to an integer: 


Live) RESA a nE Ney S) 
Ont imik 


To convert strings to floating-point numbers, use the built-in float function. 


2.7 DECISION MAKING: THE IF STATEMENT AND 
COMPARISON OPERATORS 


A condition is a Boolean expression with the value True or False. The following 


determines whether 7 is greater than 4 and whether 7 is less than 4: 


ahap [bakes ye! 
owe M True 


ma 2l <A 
0Out[2]: False 


True and False are Python keywords. Using a keyword as an identifier causes a 


Syntax-Error. True and False are each capitalized. 


You'll often create conditions using the comparison operators in the following 
table: 


Algebraic Python Sample ? 
Meaning 


operator operator condition 





> > z> y x is greater than y 


< < ssy x is less than y 


x is greater than or 


IV 
M 
II 
x 
V 
II 
Ke 


equal to y 

x is less than or equal 
< <= x <= y 

to y 
= == x == y x is equal to y 
z [= x l= y x is not equal to y 


Operators >, <, >= and <= all have the same precedence. Operators == and != both 
have the same precedence, which is lower than that of >, <, >= and <=. A syntax error 
occurs when any of the operators ==, ! =, >= and <= contains spaces between its pair of 


symbols: 


lick here to view code image 


ine [Sis 7 4 
File "<ipython-input-—3-5c6e2897f3b3>", line 1 
7 >= 4 








SyntaxError: invalid syntax 


Another syntax error occurs if you reverse the symbols in the operators ! =, >= and <= 


(by writing them as =!, => and =<). 


Making Decisions with the i £ Statement: Introducing Scripts 


We now present a simple version of the if statement, which uses a condition to 


decide whether to execute a statement (or a group of statements). Here we'll read two 





integers from the user and compare them using six consecutive if statements, one for 





each comparison operator. If the condition in a given if statement is True, the 


corresponding print statement executes; otherwise, it’s skipped. 


IPython interactive mode is helpful for executing brief code snippets and seeing 
immediate results. When you have many statements to execute as a group, you typically 


write them as a script stored in a file with the . py (short for Python) extension—such 





as fig02 01.py for this example’s script. Scripts are also called programs. For 
instructions on locating and executing the scripts in this book, see hapter 1’s [Python 


Test-Drive. 


Each time you execute this script, three of the six conditions are True. To show this, we 
execute the script three times—once with the first integer less than the second, once 
with the same value for both integers and once with the first integer greater than the 


second. The three sample executions appear after the script 


Each time we present a script like the one below, we introduce it before the figure, then 
explain the script’s code after the figure. We show line numbers for your convenience— 
these are not part of Python. IDEs enable you to choose whether to display line 
numbers. To run this example, change to this chapter’s ch02 examples folder, then 


enter: 


ipython tag02 Ol spy 


or, if you’re in [Python already, you can use the command: 


runi figo OND y: 


lick here to view code image 























dr fig02 0l.py 

2 """Comparing integers using if statements and comparison operators, "" 
3 

4 print ("Enter two anteders, and L wili ‘cell you"; 
5 "the relationships they satisfy.') 

6 

7 # read first integer 

8 numberl = ime (Ganpili (Enter first integers y) 

9 

10 # read second integer 

11 number2 = inte (input ( Enter second integer: ')) 
12 

13 if numberl == number2: 

14 print (number i; "is equal to', number2) 

15 

16 if numberl != number2: 

17 print (number l; "is not equal to", number?) 
18 

19 if numberl < number2: 

20 print (numberl, ‘is less than', number2) 

21 

22 if numberl > number2: 

23 print (numberl, ‘is greater than', number2) 
24 

25 if numberl <= number2: 

26 print (numberl, ‘is less than or equal to', number2) 
27 





28 if numberl >= number2: 


29 print (numberl, ‘is greater than or equal to', number2) 


A ť > 








lick here to view code image 


Enter two integers and I will tell you the relationships they Saens Ey 


Enter first integer: 37 





Enter second integer: 42 

37 is not equal to 42 

37 is less than 42 

37 is less than or equal to 42 


T 

















lick here to view code image 


Enter two integers and I will tell you the relationships they 


Enter first integer: 7 





Enter second integer: 7 
7 ws: equal to 7 
7 is less than or equal to 7 


7 is greater than or equal to 7 











lick here to view code image 


two integers and I will tell you the relationships they gat TSEV- 
first integer: 54 





second integer: 17 


noe egual to ly 


greater than 17 


greater than or equal to 17 











Comments 


Line 1 begins with the hash character (#), which indicates that the rest of the line is a 


comment: 


# EO Ol.py 


For easy reference, we begin each script with a comment indicating the script’s file 
name. A comment also can begin to the right of the code on a given line and continue 
until the end of that line. 


Docstrings 


The Style Guide for Python Code states that each script should start with a docstring 


that explains the script’s purpose, such as the one in line 2: 


"""Comparing integers using if statements and comparison operators. TAN 


For more complex scripts, the docstring often spans many lines. In later chapters, you'll 
use docstrings to describe script components you define, such as new functions and 
new types called classes. We'll also discuss how to access docstrings with the [Python 


help mechanism. 


Blank Lines 


Line 3 is a blank line. You use blank lines and space characters to make code easier to 
read. Together, blank lines, space characters and tab characters are known as white 


space. Python ignores most white space—you'll see that some indentation is required. 


Splitting a Lengthy Statement Across Lines 


Lines 4—5 


lick here to view code image 


print Enter two integers, and I will tell you", 





the relationships they satisfy.') 


display instructions to the user. These are too long to fit on one line, so we broke them 
into two strings. Recall that you can display several values by passing to printa 


comma-separated list—print separates each value from the next with a space. 


Typically, you write statements on one line. You may spread a lengthy statement over 
several lines with the \ continuation character. Python also allows you to split long 
code lines in parentheses without using continuation characters (as in lines 4—5). This 
is the preferred way to break long code lines according to the Style Guide for Python 
Code. Always choose breaking points that make sense, such as after a comma in the 


preceding call to print or before an operator in a lengthy expression. 


Reading Integer Values from the User 


Next, lines 8 and 11 use the built-in input and int functions to prompt for and read 


two integer values from the user. 


if Statements 


The if statement in lines 13—14 


lick here to view code image 


if numberl == number2: 





print (numberl, "is equal to', number2) 


uses the == comparison operator to determine whether the values of variables 


numberl and number2 are equal. If so, the condition is True, and line 14 displays a 


ine of text indicating that the values are equal. If any of the remaining if statements’ 
conditions are True (lines 16, 19, 22, 25 and 28), the corresponding print displays a 
line of text. 


Each if statement consists of the keyword if, the condition to test, and a colon (:) 
followed by an indented body called a suite. Each suite must contain one or more 


statements. Forgetting the colon (:) after the condition is a common syntax error. 


Suite Indentation 


Python requires you to indent the statements in suites. The Style Guide for Python 
Code recommends four-space indents—we use that convention throughout this book. 


You'll see in the next chapter that incorrect indentation can cause errors. 


Confusing == and = 


Using the assignment symbol (=) instead of the equality operator (==) in an if 
statement’s condition is acommon syntax error. To help avoid this, read == as “is equal 
to” and = as “is assigned.” You'll see in the next chapter that using == in place of = in an 


assignment statement can lead to subtle problems. 


Chaining Comparisons 


You can chain comparisons to check whether a value is in a range. The following 


comparison determines whether x is in the range 1 through 5, inclusive: 


in lees tS 


or PAG Al ge Bes es) 
Out[2]: True 


TaS] x IQ 


m Al We 5 
Out[4]: False 





Precedence of the Operators We’ve Presented So Far 


The precedence of the operators introduced in this chapter is shown below: 


Operators Grouping Type 





left to 


() right parentheses 
right to ee 
a exponentiation 
left 
TFT left to multiplication, true division, floor division, 
i right remainder 
left to _ i 
p= ] addition, subtraction 
right 
> <= < left to less than, less than or equal, greater than, greater 
— right than or equal 
left to 
== l= , equal, not equal 
right 


The table lists the operators top-to-bottom in decreasing order of precedence. When 
writing expressions containing multiple operators, confirm that they evaluate in the 


order you expect by referring to the operator precedence chart at 


ttps://docs.python.org/3/reference/expressions.html#operator-precedence 


2.8 OBJECTS AND DYNAMIC TYPING 


Values such as 7 (an integer), 4 . 1 (a floating-point number) and 'dog' are all objects. 


Every object has a type and a value: 


Im [je seypec7) 
Owe Pale ENE 


In [2]: type(4.1) 


Oui es loat 
In [3]: tyoe(*dog' ) 


Out [Sas ysis 


An object’s value is the data stored in the object. The snippets above show objects of 
built-in types int (for integers), float (for floating-point numbers) and str (for 


strings). 


Variables Refer to Objects 


Assigning an object to a variable binds (associates) that variable’s name to the object. 


As you've seen, you can then use the variable in your code to access the object’s value: 


tae) [Ags x = 7 


TaT (Suey exe ch o 





Oita Sale 7. 
ta lel: x 
Out el: 7 


After snippet [4]’s assignment, the variable x refers to the integer object containing 
7. As shown in snippet [6], snippet [5] does not change x’s value. You can change x 


as follows: 


Terabe || fehl 
Ome [Si iy 
Dynamic Typing 


Python uses dynamic typing—it determines the type of the object a variable refers to 
while executing your code. We can show this by rebinding the variable x to different 


objects and checking their types: 
in [9]: seypex) 
cutl le int 
TA LOI et As 


Ta e type(s) 
Out [LI tioat 





ia, M2 se Vee: 


Ta ls types) 
Owe URS 3 esis 


Garbage Collection 


Python creates objects in memory and removes them from memory as necessary. After 





snippet [10], the variable x now refers to a float object. The integer object from 
snippet [7] is no longer bound to a variable. As we'll discuss in a later chapter, Python 
automatically removes such objects from memory. This process—called garbage 


collection—helps ensure that memory is available for new objects you create. 


2.9 INTRO TO DATA SCIENCE: BASIC DESCRIPTIVE 
STATISTICS 


In data science, you'll often use statistics to describe and summarize your data. Here, 


we begin by introducing several such descriptive statistics, including: 


e minimum-—the smallest value in a collection of values. 

e maximum-—the largest value in a collection of values. 

e range—the range of values from the minimum to the maximum. 
e count—the number of values in a collection. 


e sum —the total of the values in a collection. 


We'll look at determining the count and sum in the next chapter. Measures of 
dispersion (also called measures of variability), such as range, help determine 
how spread out values are. Other measures of dispersion that we'll present in later 


chapters include variance and standard deviation. 


Determining the Minimum of Three Values 


First, let’s show how to determine the minimum of three values manually. The 





following script prompts for and inputs three values, uses if statements to determine 


the minimum value, then displays it. 


lick here to view code image 


# fig02 02.py 


"""Find the minimum of three yalueg. Tr 


numberl = SLES (ARIE (Ents first integer: 1) 
SD 


int (input ("Enter third integer: J) 


int (input('Enter second integer: 





number3 


1 

2 

3 

4 

5 number2 
6 

7 

8 minimum = number1 
9 


10 if number2 < minimum: 





11 minimum = number2 
12 

13 if numbers < minimum: 
14 minimum = number3 
15 


16 print ("Minimum value is”, minimum) 


lick here to view code image 


Enter first integer: 12 


Enter second integer: 27 


Enter third integer: 36 





Minimum value is 12 





lick here to view code image 


Enter first integer: 27 
Enter second integer: 12 


Enter third integer: 36 





Minimum value is 12 





lick here to view code image 


Enter first integer: 36 


Enter second integer: 27 


Enter third integer: 12 





Minimum value is 12 





After inputting the three values, we process one value at a time: 


e First, we assume that number1 contains the smallest value, so line 8 assigns it to 


the variable minimum. Of course, it’s possible that number2 or number3 contains 


the actual smallest value, so we still must compare each of these with minimum. 


e The first if statement (lines 10—11) then tests number2 < minimum and if this 


condition is True assigns number2 to minimum. 


e The second if statement (lines 13-14) then tests number3 < minimum, and if this 


condition is True assigns number3 to minimum. 


Now, minimum contains the smallest value, so we display it. We executed the script 
three times to show that it always finds the smallest value regardless of whether the 


user enters it first, second or third. 


Determining the Minimum and Maximum with Built-In Functions min and 
max 


Python has many built-in functions for performing common tasks. Built-in functions 
min and max calculate the minimum and maximum, respectively, of a collection of 


values: 


Eor e min Se 2 12) 
outi 2 


taele max (oO 2 2) 
Out 2l: S'6 


The functions min and max can receive any number of arguments. 


Determining the Range of a Collection of Values 


The range of values is simply the minimum through the maximum value. In this case, 
the range is 12 through 36. Much data science is devoted to getting to know your data. 
Descriptive statistics is a crucial part of that, but you also have to understand how to 
interpret the statistics. For example, if you have 100 numbers with a range of 12 
through 36, those numbers could be distributed evenly over that range. At the opposite 
extreme, you could have clumping with 99 values of 12 and one 36, or one 12 and 99 


values of 36. 


Functional-Style Programming: Reduction 


Throughout this book, we introduce various functional-style programming 
capabilities. These enable you to write code that can be more concise, clearer and easier 


to debug—that is, find and correct errors. The min and max functions are examples of 


functional-style programming concept called reduction. They reduce a collection of 
values to a single value. Other reductions you'll see include the sum, average, variance 
and standard deviation of a collection of values. You'll also see how to define custom 


reductions. 


Upcoming Intro to Data Science Sections 


In the next two chapters, we'll continue our discussion of basic descriptive statistics 
with measures of central tendency, including mean, median and mode, and measures 


of dispersion, including variance and standard deviation. 


2.10 WRAP-UP 


This chapter continued our discussion of arithmetic. You used variables to store values 
for later use. We introduced Python’s arithmetic operators and showed that you must 
write all expressions in straight-line form. You used the built-in function print to 
display data. We created single-, double- and triple-quoted strings. You used triple- 
quoted strings to create multiline strings and to embed single or double quotes in 


strings. 


You used the input function to prompt for and get input from the user at the 





keyboard. We used the functions int and float to convert strings to numeric values. 


We presented Python’s comparison operators. Then, you used them in a script that read 





two integers from the user and compared their values using a series of if statements. 


We discussed Python’s dynamic typing and used the built-in function type to display 
an object’s type. Finally, we introduced the basic descriptive statistics minimum and 
maximum and used them to calculate the range of a collection of values. In the next 


chapter, we present Python’s control statements. 


3. Control Statements 


Objectives 


In this chapter, you'll: 





m Make decisions with if, if else and if elif else. 


mw Execute statements repeatedly with while and for. 


mw Shorten assignment expressions with augmented assignments. 





m Use the for statement and the built-in range function to repeat actions for a 


sequence of values. 

m Perform sentinel-controlled iteration with while. 

m Create compound conditions with the Boolean operators and, or and not. 
mw Stop looping with break. 

mw Force the next iteration of a loop with continue. 


mw Use functional-style programming features to write scripts that are more concise, 


clearer, easier to debug and easier to parallelize. 


Outline 
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3.1 INTRODUCTION 


In this chapter, we present Python’s control statements—if, if else, if elif else, 



































while, for, break and continue. You'll use the for statement to perform sequence- 


controlled- iteration—you'll see that the number of items in a sequence of item 





determines the for statement’s number of iterations. You'll use the built-in function 


range to generate sequences of integers. 


We'll show sentinel-controlled iteration with the while statement. You'll use the 
Python Standard Library’s Decimal type for precise monetary calculations. We’ll 
format data in f-strings (that is, format strings) using various format specifiers. We'll 
also show the Boolean operators and, or and not for creating compound conditions. In 
the Intro to Data Science section, we'll consider measures of central tendency—mean, 


median and mode—using the Python Standard Library’s statistics module. 


.2 CONTROL STATEMENTS 


Python provides three selection statements that execute code based on a condition that 


evaluates to either True or False: 





e The if statement performs an action if a condition is True or skips the action if the 


condition is False. 


e Theif... else statement performs an action if a condition is True or performs 


a different action if the condition is False. 


e Theif... elif... else statement performs one of many different actions, 


depending on the truth or falsity of several conditions. 


Anywhere a single action can be placed, a group of actions can be placed. 





Python provides two iteration statements—while and for: 


e The while statement repeats an action (or a group of actions) as long as a 


condition remains True. 


e The for statement repeats an action (or a group of actions) for every item in a 


sequence of items. 


Keywords 





The words if, elif, else, while, for, True and False are Python keywords. Using 














a keyword as an identifier such as a variable name is a syntax error. The following table 


lists Python’s keywords. 


Python keywords 





and as assert async await 





break class continue def del 





elif else except False finally 








EOL from global IAE LOCE 
LA LS lambda None nonlocal 
not or pass raise return 
True ery while with yield 


3.3 IF STATEMENT 





Let’s execute a Python if statement: 


lick here to view code image 


In [1]: grade = 85 
In [2]: if grade >= 60: 


print ('Passed') 


Passed 


` 





The condition grade >= 60 is True, so the indented print statement in the i f’s 


suite displays 'Passed'. 


Suite Indentation 


Indenting a suite is required; otherwise, an IndentationError syntax error occurs: 


lick here to view code image 


le Sits 9 isk grade- G0: 
: print('Passed') # statement is not indented properly 
File "“<ipython-anput-3-£42783904220>", line 2 


print('Passed') # statement is not indented properly 


A 


IndentationError: expected an indented block 


An IndentationError also occurs if you have more than one statement in a suite 


and those statements do not have the same indentation: 


lick here to view code image 


in [42 ak grade >= 60: 


print('Passed') # indented 4 spaces 
print('Good job!) F incorrectly indented only two spaces 
File <ipython-input-—4-8c0d75cl27bf>, line 3 
print('Good job!) # incorrectly indented only two spaces 


A 


IndentationError: unindent does not match any outer indentation level 


Sometimes error messages may not be clear. The fact that Python calls attention to the 
line is usually enough for you to figure out what’s wrong. Apply indentation 
conventions uniformly throughout your code—programs that are not uniformly 


indented are hard to read. 


Every Expression Can Be Interpreted as Either True or False 


You can base decisions on any expression. A nonzero value is True. Zero is False: 


lick here to view code image 


print ('Nonzero values are Erue, Sor Chis wili printe) 


Nonzero values are true, so this will print 


printe (Aero ts false, so Chis will Dot printi) 


Strings containing characters are True and empty strings ('', ""or"""""") are 


False. 


Confusing == and = 


Using the equality operator == instead of = in an assignment statement can lead to 


subtle problems. For example, in this session, snippet [1] defined grade with the 


assignment: 


grade = 85 


If instead we accidentally wrote: 


grade == 85 


then grade would be undefined and we'd get a NameError. If grade had been defined 


before the preceding statement, then grade == 85 would simply evaluate to True or 


False, and not perform an assignment. This is a logic error. 


3.41F ELSE AND IF ELIF ELSE STATEMENTS 





The if else statement executes different suites, based on whether a condition is True 


or False: 


lick here to view code image 


In [1]: grade = 85 


In [2]: if grade >= 60: 
print ('Passed') 
: else: 


print ('Failed') 


Passed 


The condition above is True, so the if suite displays 'Passed'. Note that when you 





press Enter after typing print ('Passed'), [Python indents the next line four spaces. 


You must delete those four spaces so that the else: suite correctly aligns under the i 





in if. 





The following code assigns 57 to the variable grade, then shows the if else 
statement again to demonstrate that only the else suite executes when the condition is 


False: 


lick here to view code image 


in lol: grade = 57 


In [4]: if grade >= 60: 
print (Passed) 
: else: 
print Mead Led") 


Failed 
Use the up and down arrow keys to navigate backwards and forwards through the 
current interactive session’s snippets. Pressing Enter re-executes the snippet that’s 


displayed. Let’s set grade to 99, press the up arrow key twice to recall the code from 


snippet [4], then press Enter to re-execute that code as snippet [6]. Every recalled 


snippet that you execute gets a new ID: 


lick here to view code image 


in? Slk ograde = 99 
In [6]: if grade >= 60: 
print ('Passed') 
: else: 


print (Pai dee) 


Passed 


Conditional Expressions 





Sometimes the suites in an if else statement assign different values to a variable, 


based on a condition, as in: 


lick here to view code image 


In [7]: grade = 87 


In [8]: if grade >= 60: 
result = 'Passed' 


: else: 
result = 'Failed' 


We can then print or evaluate that variable: 


In [9]: result 
Our [9]: 'Passed' 


You can write statements like snippet [8] using a concise conditional expression: 


lick here to view code image 





In [10]: result = ('Passed' if grade >= 60 else 'Failed') 


tTa [ess result 
Outil}: "Passed! 


The parentheses are not required, but they make it clear that the statement assigns the 
conditional expression’s value to result. First, Python evaluates the condition grade 
>= 60: 


e Ifit’s True, snippet [10] assigns to result the value of the expression to the left 


of if, namely 'Passed'. The else part does not execute. 


e Ifit’s False, snippet [10] assigns to result the value of the expression to the 


right of else, namely 'Failed'. 


In interactive mode, you also can evaluate the conditional expression directly, as in: 


lick here to view code image 





In [12]: 'Passed' if grade >= 60 else 'Failed' 
OuELi Als 'easised! 


Multiple Statements in a Suite 





The following code shows two statements in the else suite ofan if... else 


statement: 


lick here to view code image 


In [13]: grade = 49 


In [14]: if grade >= 60: 
print ('Passed') 
else: 
prine Paaled™) 


print('You must take this course again") 


Failed 


You must take this course again 


In this case, grade is less than 60, so both statements in the e1se’s suite execute. 


If you do not indent the second print, then it’s not in the else’s suite. So, that 


statement always executes, possibly creating strange incorrect output: 


lick here to view code image 


In [15]: grade = 100 


Tta (Led ift grade >= 60: 
print ("Passed") 
: else: 
print ('Failed') 


J prank) Yeu music take this “course again” ) 


Passed 


You must take this course again 


if...elif...else Statement 


You can test for many cases using theif... elif... else statement. The 
following code displays “A” for grades greater than or equal to 90, “B” for grades in the 
range 80-89, “C” for grades 70-79, “D” for grades 60—69 and “F” for all other grades. 
Only the action for the first True condition executes. Snippet [18] displays C, because 


gradeis 77: 


In [17s grade = 77 


Ta Miel if orade >= 90: 
Pere (VAY) 
: elif grade >= 80: 





PEMET B) 

: elif grade >= 70: 
PENE) 

: elif grade >= 60: 
peime eD) 

: else: 
PEER] 


The first condition—grade >= 90—is False,so print ('A') is skipped. The second 
condition—grade >= 80—alsois False,so print ('B') is skipped. The third 


condition—grade >= 70—is True, so print ('C') executes. Then all the remaining 














code inthe if... elif... else statement is skipped. An if... elif... elseis 





faster than separate if statements, because condition testing stops as soon as a 


condition is True. 


else Is Optional 








The elseintheif... elif... else statement is optional. Including it enables you 








to handle values that do not satisfy any of the conditions. When an if... elif 
statement without an else tests a value that does not make any of its conditions True, 


the program does not execute any of the statement’s suites—the next statement in 








sequence after the if... elif... statement executes. If you specify the else, you 





must place it after the last elif; otherwise, a SyntaxError occurs. 


Logic Errors 


The incorrectly indented code segment in snippet [16] is an example of a nonfatal 
logic error. The code executes, but it produces incorrect results. For a fatal logic error 


in a script, an exception occurs (such as a Zero-DivisionError from an attempt to 





divide by 0), so Python displays a traceback, then terminates the script. A fatal error in 
interactive mode terminates only the current snippet—then [Python waits for your next 


input. 


3.5 WHILE STATEMENT 


The while statement allows you to repeat one or more actions while a condition 


remains True. Let’s use a while statement to find the first power of 3 larger than 50: 


lick here to view code image 


ha [lj produce = 3 


in I2]: while product <= 50: 
product = product * 3 


In ol: produce 
Out fsi: sal 


Snippet [3] evaluates product to see its value, 81, which is the first power of 3 larger 
than 50. 


Something in the while statement’s suite must change product’s value, so the 


condition eventually becomes False. Otherwise, an infinite loop occurs. In 
applications executed from a Terminal, Anaconda Command Prompt or shell, type Ctrl 
+ c or control + c to terminate an infinite loop. IDEs typically have a toolbar button or 


menu option for stopping a program’s execution. 


3.6 FOR STATEMENT 


The for statement allows you to repeat an action or several actions for each item in a 
sequence of items. For example, a string is a sequence of individual characters. Let’s 


display 'Programming' with its characters separated by two spaces: 


lick here to view code image 


in lit “hor ichawacter i0 Uemogmanmane + 


print (character, end=' 1) 


The for statement executes as follows: 





e Upon entering the statement, it assigns the 'P' in 'Programming' to the target 


variable between keywords for and in—in this case, character. 


e Next, the statement in the suite executes, displaying character’s value followed by 


two spaces—we'll say more about this momentarily. 


e After executing the suite, Python assigns to character the next item in the 


sequence (that is, the 'r' in 'Programming'), then executes the suite again. 


e This continues while there are more items in the sequence to process. In this case, 


the statement terminates after displaying the letter 'g', followed by two spaces. 


Using the target variable in the suite, as we did here to display its value, is common but 


not required. 


Function print’s end Keyword Argument 


The built-in function print displays its argument(s), then moves the cursor to the next 


line. You can change this behavior with the argument end, as in 


print (character, end=' ue) 


which displays character’s value followed by two spaces. So, all the characters 
display horizontally on the same line. Python calls end a keyword argument, but 
end itself is not a Python keyword. Keyword arguments are sometimes called named 
arguments. The end keyword argument is optional. If you do not include it, print 
uses a newline ('\n') by default. The Style Guide for Python Code recommends 


placing no spaces around a keyword argument’s =. 


Function print’s sep Keyword Argument 


You can use the keyword argument sep (short for separator) to specify the string that 
appears between the items that print displays. When you do not specify this 
argument, print uses a space character by default. Let’s display three numbers, each 


separated from the next by a comma and a space, rather than just a space: 


lick here to view code image 


Tanp] permis (O 0 a0), sisp=" T 
IO BO O 


To remove the default spaces, use sep='' (that is, an empty string). 


3.6.1 Iterables, Lists and Iterators 





The sequence to the right of the for statement’s in keyword must be an iterable—that 





is, an object from which the for statement can take one item at a time until no more 
items remain. Python has other iterable sequence types besides strings. One of the most 
common is a list, which is a comma-separated collection of items enclosed in square 


brackets ( [ and ] ). The following code totals five integers in a list: 


lick here to view code image 


ine Sis eorale—. 0 


Ieee boy Eor Number kia P45 aca la ive S: 
total = total + number 


ike? Wee Seoneeul 
Outsole 25 





Each sequence has an iterator. The for statement uses the iterator “behind the 
scenes” to get each consecutive item until there are no more to process. The iterator is 
like a bookmark—it always knows where it is in the sequence, so it can return the next 
item when it’s called upon to do so. We cover lists in detail in the “Sequences: Lists and 
Tuples” chapter. There, you'll see that the order of the items in a list matters and that a 


list’s items are mutable (that is, modifiable). 


3.6.2 Built-In range Function 





Let’s use a for statement and the built-in range function to iterate precisely 10 


times, displaying the values from 0 through 9: 


lick here to view code image 


En [6]: for counter in ranga ro): 


print (counter, end=' ') 


o2 SAS GF 7in9 


The function call range (10) creates an iterable object that represents a sequence of 


consecutive integers starting from 0 and continuing up to, but not including, the 





argument value (10)—in this case, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. The for statement exits 
when it finishes processing the last integer that range produces. Iterators and iterable 
objects are two of Python’s functional-style programming features. We'll introduce 


more of these throughout the book. 


Off-By-One Errors 


A common type of off-by-one error occurs when you assume that range’s argument 
value is included in the generated sequence. For example, if you provide 9 as range’s 
argument when trying to produce the sequence 0 through 9, range generates only 0 
through 8. 


3.7 AUGMENTED ASSIGNMENTS 


Augmented assignments abbreviate assignment expressions in which the same 


variable name appears on the left and right of the assignment’s =, as total does in: 


lick here to view code image 


for number am Ii 2, 3, 4, Sills 


total 


Snippet [2] 


statement: 


total + number 


reimplements this using an addition augmented assignment (+=) 


lick here to view code image 


breve HEE 
csi 2i; 
rey SE: 
Owel: 


total = 0 
for, number any (i, 27 237 4 Ss 

total += number # add number to total 
total 


15 





The += expression in snippet [2] first adds number’s value to the current total, then 


stores the new value in total. The table below shows sample augmented assignments: 


Augmented 


assignment 





Sample : R 
Explanation Assigns 


expression 


Assume c = 3, d=5, eE g = 9, h = 12 

+= c += 7 ej Cc nr 7 10toc 
= d -= 4 d=da- 4 1tod 

*= e *= 5 e=e* 5 20 toe 
xx= f **= 3 i = ip tow S| 8 tof 

/= g /= 2 g=g/2 4.5tog 
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~ 
lI 

Q 
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SS 
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N 
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II 
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II 
EF 
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lI 
WO 
= 
lI 
DE 
ol? 
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3.8 SEQUENCE-CONTROLLED ITERATION; FORMATTED 
STRINGS 


This section and the next solve two class-averaging problems. Consider the following 


requirements statement: 


A class of ten students took a quiz. Their grades (integers in the range o — 100) are 98, 
76, 71, 87, 83, 90, 57, 79, 82, 94. Determine the class average on the quiz. 


The following script for solving this problem keeps a running total of the grades, 
calculates the average and displays the result. We placed the 10 grades in a list, but you 
could input the grades from a user at the keyboard (as we'll do in the next example) or 
read them from a file (as you'll see how to do in the “Files and Exceptions” chapter). We 
show how to read data from SQL and NoSQL databases in hapter 16. 


lick here to view code image 


# class average.py 


"""Class average program with sequence-controlled iteration. nn 


# initialization phase 

total = 0 # sum of grades 

Grade counter = 0 

grades — os von Vl Sn B83 OOo ole 6, SA 4 et St or O grades 


ow ONAN DU F&F WD BE 


# processing phase 


10 for grade in grades: 








all total += grade # add current grade to the running total 
12 grade counter += 1 # indicate that one more grade was process 
T3 


14 # termination phase 
15 average = total / grade_counter 


16 print(f'Class average is {average}') 








lick here to view code image 


Class average is 81.7 





Lines 5—6 create the variables total and grade_counter and initialize each to 0. 


Line 7 


lick here to view code image 


grades = 987 Si (eee p Ole! TO S2 oe ects Seo Orgrades 


creates the variable grades and initializes it with a list of 10 integer grades. 





The for statement processes each grade in the list grades. Line 11 adds the current 
grade to the total. Then, line 12 adds 1 to the variable grade_counter to keep 
track of the number of grades processed so far. Iteration terminates when all 10 grades 


in the list have been processed. The Style Guide for Python Code recommends placing a 





blank line above and below each control statement (as in lines 8 and 13). When the for 
statement terminates, line 15 calculates the average and line 16 displays it. Later in this 
chapter, we use functional-style programming to calculate the average of a list’s items 


more concisely. 


Introduction to Formatted Strings 


Line 16 uses the following simple f-string (short for formatted string) to format this 


script’s result by inserting the value of average into a string: 


lick here to view code image 


f'Class average is {average}' 


The letter £ before the string’s opening quote indicates it’s an f-string. You specify 





where to insert values by using placeholders delimited by curly braces ({ and }). The 


placeholder 


{average} 


converts the variable average’s value to a string representation, then replaces 


{average} with that replacement text. Replacement-text expressions may contain 
values, variables or other expressions, such as calculations or function calls. In line 16, 
we could have used total / grade counter in place of average, eliminating the 


need for line 15. 


3.9 SENTINEL-CONTROLLED ITERATION 


Let’s generalize the class-average problem. Consider the following requirements 


statement: 


Develop a class-averaging program that processes an arbitrary number of grades 


each time the program executes. 


The requirements statement does not state what the grades are or how many there are, 
so we re going to have the user enter the grades. The program processes an arbitrary 
number of grades. The user enters grades one at a time until all the grades have been 
entered, then enters a sentinel value (also called a signal value, a dummy value or a 


flag value) to indicate that there are no more grades. 


implementing Sentinel-Controlled Iteration 


The following script solves the class average problem with sentinel-controlled iteration. 
Notice that we test for the possibility of division by zero. If undetected, this would cause 
a fatal logic error. In the “Files and Exceptions” chapter, we write programs that 


recognize such exceptions and take appropriate actions. 


lick here to view code image 


# class average sentinel.py 





"""Class average program with sentinel-controlled iteration.""™" 
# initialization phase 
total = 0 # sum of grades 


grade counter = 0 # number of grades entered 


# processing phase 


ow ONAN HD UU F&F WD 


grade = int(input('Enter grade, -1 to end: ')) # get one grade 


H H 
H oO 


while grade != -1: 


m 
N 


total += grade 


m 
w 


grade counter t= l 


m 
A 





grade = int(input('Enter grade, -1 to end: ')) 
15 
16 # termination phase 


Ly tie grade counter l= 0; 


18 average = total / grade counter 


19 print(f'Class average is {average:.2f}') 
20 else: 
21 print ('No grades wer ntered') 





lick here to view code image 


grade, I to end: 
grade, -1 to end: 
grade, -1 to end: 





grade, -1 to end: 


average is 85.67 





Program Logic for Sentinel-Controlled Iteration 


In sentinel-controlled iteration, the program reads the first value (line 9) before 
reaching the while statement. The value input in line 9 determines whether the 
program’s flow of control should enter the while’s suite (lines 12—14). If the condition 
in line 11 is False, the user entered the sentinel value (-1), so the suite does not 
execute because the user did not enter any grades. If the condition is True, the suite 
executes, adding the grade value to the total and incrementing the 


grade counter, 


Next, line 14 inputs another grade from the user and the condition (line 11) is tested 
again, using the most recent grade entered by the user. The value of grade is always 
input immediately before the program tests the while condition, so we can determine 


whether the value just input is the sentinel before processing that value as a grade. 


When the sentinel value is input, the loop terminates, and the program does not add —1 
to total. In a sentinel-controlled loop that performs user input, any prompts (lines 9 


and 14) should remind the user of the sentinel value. 


Formatting the Class Average with Two Decimal Places 


This example formatted the class average with two digits to the right of the decimal 
point. In an f-string, you can optionally follow a replacement-text expression with a 


colon (:) and a format specifier that describes how to format the replacement text. 








The format specifier . 2f (line 19) formats the average as a floating-point number (£) 
with two digits to the right of the decimal point (. 2). In this example, the sum of the 
grades was 257, which, when divided by 3, yields 85.666666666.... Formatting the 





average with .2f rounds it to the hundredths position, producing the replacement 


text 85.67. An average with only one digit to the right of the decimal point would be 
formatted with a trailing zero (e.g., 85.50). The chapter “Strings: A Deeper Look” 


discusses many more string-formatting features. 


3.10 BUILT-IN FUNCTION RANGE: A DEEPER LOOK 


Function range also has two- and three-argument versions. As you've seen, range’s 
one-argument version produces a sequence of consecutive integers from 0 up to, but 
not including, the argument’s value. Function range’s two-argument version produces 
a sequence of consecutive integers from its first argument’s value up to, but not 


including, the second argument’s value, as in: 


lick here to view code image 
Im (ij for number an range (S, 10): 
print (number, end=' ') 
56789 
Function range’s three-argument version produces a sequence of integers from its first 


argument’s value up to, but not including, the second argument’s value, incrementing 


by the third argument’s value, which is known as the step: 


lick here to view code image 
Ta 2l: for number in range(0 r07 2): 
print (number, end=' ') 
02468 
If the third argument is negative, the sequence progresses from the first argument’s 


value down to, but not including the second argument’s value, decrementing by the 


third argument’s value, as in: 


lick here to view code image 


ne Sit for amumber an range (I0 07 72): 


print (number, end=' ') 


10rako 42 


3.11 USING TYPE DECIMAL FOR MONETARY AMOUNTS 


In this section, we introduce Decimal capabilities for precise monetary calculations. If 
you're in banking or other fields that require “to-the-penny” accuracy, you should 


investigate Decimal’s capabilities in depth. 


For most scientific and other mathematical applications that use numbers with decimal 
points, Python’s built-in floating-point numbers work well. For example, when we 
speak of a “normal” body temperature of 98.6, we do not need to be precise to a large 
number of digits. When we view the temperature on a thermometer and read it as 98.6, 
the actual value may be 98.5999473210643. The point here is that calling this number 


98.6 is adequate for most body-temperature applications. 


Floating-point values are stored in binary format (we introduced binary in the first 
chapter and discuss it in depth in the online “Number Systems” appendix). Some 
floating-point values are represented only approximately when they’re converted to 
binary. For example, consider the variable amount with the dollars-and-cents value 


112.31. If you display amount, it appears to have the exact value you assigned to it: 


ta [dike amount = 12.31 


in 2l printe (amount) 
US Sal 


However, if you print amount with 20 digits of precision to the right of the decimal 
point, you can see that the actual floating-point value in memory is not exactly 112.31 


—it’s only an approximation: 


lick here to view code image 


ToS]: ore tes (Gh amount: 20 te tt) 
112.31000000000000227374 


Many applications require precise representation of numbers with decimal points. 
Institutions like banks that deal with millions or even billions of transactions per day 
have to tie out their transactions “to the penny.” Floating-point numbers can represent 


some but not all monetary amounts with to-the-penny precision. 


The Python Standard Library * provides many predefined capabilities you can use 
in your Python code to avoid “reinventing the wheel.” For monetary calculations and 


other applications that require precise representation and manipulation of numbers 


with decimal points, the Python Standard Library provides type Decimal, which uses a 
special coding scheme to solve the problem of to-the-penny precision. That scheme 
requires additional memory to hold the numbers and additional processing time to 
perform calculations but provides the precision required for monetary calculations. 
Banks also have to deal with other issues such as using a fair rounding algorithm when 


they're calculating daily interest on accounts. Type Decimal offers such capabilities. ° 
ttps://docs.python.org/3.7/library/index.html. 


* For more decimal module features, visit 


ttps://docs.python.org/3.7/library/decimal.html. 


Importing Type Decimal from the decimal Module 


We've used several built-in types—int (for integers, like 10), float (for floating-point 
numbers, like 7.5) and str (for strings like 'Python'). The Decimal type is not built 
into Python. Rather, it’s part of the Python Standard Library, which is divided into 
groups of related capabilities called modules. The decimal module defines type 


Decimal and its capabilities. 


To use type Decimal, you must first import the entire decimal module, as in 


import decimal 


and refer to the Decimal type as decimal. Decimal, or you must indicate a specific 


capability to import using from import, as we do here: 


lick here to view code image 


In [4]: from decimal import Decimal 


This imports only the type Decimal from the decimal module so that you can use it 


in your code. We'll discuss other import forms beginning in the next chapter. 


Creating Decimals 


You typically create a Decimal from a string: 


lick here to view code image 


Im lol: principal = Decimal ("1000.00") 


In [ol]: principal 
OUI Deeamade( OOO 00") 





In [7]: rate = Decimal('0.05') 
In [8]: rate 


Out [8]: Decimal('0.05') 


We'll soon use these variables principal and rate in a compound-interest 


calculation. 


Decimal Arithmetic 


Decimals support the standard arithmetic operators +, -, *, /, //, ** and %, as well as 


the corresponding augmented assignments: 


To Ot x = Decimal (ATOS 
Ta EOE y ~ Decima Ii 2") 


Toa es y 
oue LLO Decimal (C r275) 


ion a se A 
out 2l: Decimal (150) 


ta [ES ee S y 


TOMATE 
oue lll: Decimal (12.51) 


You may perform arithmetic between Decimals and integers, but not between 


Decimals and floating-point numbers. 


Compound-Interest Problem Requirements Statement 


Let’s compute compound interest using the Decimal type for precise monetary 


calculations. Consider the following requirements statement: 


A person invests $1000 in a savings account yielding 5% interest. Assuming that the 
person leaves all interest on deposit in the account, calculate and display the amount 
of money in the account at the end of each year for 10 years. Use the following 


formula for determining these amounts: 


a =p(1 +r)” 

where 

p is the original amount invested (i.e., the principal), 
r is the annual interest rate, 

n is the number of years and 


a is the amount on deposit at the end of the nth year. 


Calculating Compound Interest 


To solve this problem, let’s use variables principal and rate that we defined in 
snippets [5] and [7], anda for statement that performs the interest calculation for 
each of the 10 years the money remains on deposit. For each year, the loop displays a 
formatted string containing the year number and the amount on deposit at the end of 


that year: 


lick here to view code image 


Ene Sie tet year: in rangel, 11): 
amount = principal * (1 + rate) ** year 
permis Cet yeam2>7:) {amount sO. 2e}-* |) 

1105:0100 

INO 2510 

ES ee. 

Pa Seoul 

TVEn 2A8 

1340.10 

TAOTO 

1477.46 

LoS 38 

MEZA Si8 9 


STO GOs S116) OS Gon Ne TS 
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The algebraic expression (1 + r)” from the requirements statement is written as 


(irate) AA year 


where variable rate represents r and variable year represents n. 


Formatting the Year and Amount on Deposit 


The statement 


lick here to view code image 
print e years >2 \atamouinic se lOi esi") 


uses an f-string with two placeholders to format the loop’s output. 


The placeholder 


{year:>2} 


uses the format specifier >2 to indicate that year’s value should be right aligned (>) 
in a field of width 2—the field width specifies the number of character positions to use 
when displaying the value. For the single-digit year values 1—9, the format specifier 

>2 displays a space character followed by the value, thus right aligning the years in the 
first column. The following diagram shows the numbers 1 and 10 each formatted in a 
field width of 2: 


field width 2 


co 


leading space 
1/0 


You can left align values with <. 





The format specifier 10 . 2f in the placeholder 


famounts>l0" 2 £4 





formats amount as a floating-point number (£f) right aligned (>) in a field width of 10 
with a decimal point and two digits to the right of the decimal point (. 2). Formatting 
the amounts this way aligns their decimal points vertically, as is typical with monetary 
amounts. In the 10 character positions, the three rightmost characters are the number’s 
decimal point followed by the two digits to its right. The remaining seven character 
positions are the leading spaces and the digits to the decimal point’s left. In this 
example, all the dollar amounts have four digits to the left of the decimal point, so each 
number is formatted with three leading spaces. The following diagram shows the 


formatting for the value 1050.00: 


field width 10 


_ TT [o[s ol [of 
leading spaces Pi two digits to right 
decimal point of decimal point 


3.12 BREAK AND CONTINUE STATEMENTS 


The break and continue statements alter a loop’s flow of control. Executing a break 
statement in a while or for immediately exits that statement. In the following code, 
range produces the integer sequence 0—99, but the loop terminates when number is 
TO; 


lick here to view code image 


in I: fer number in range (00): 
if number == 10: 
break 


print (number, end=' ') 


OAE AES E 


In a script, execution would continue with the next statement after the for loop. The 





while and for statements each have an optional e1se clause that executes only if the 


loop terminates normally—that is, not as a result of a break. 


Executing a continue statement in a while or for loop skips the remainder of the 





loop’s suite. In a while, the condition is then tested to determine whether the loop 
should continue executing. In a for, the loop processes the next item in the sequence 


(if any): 


lick here to view code image 


in) [2]: for number in range (LON: 
if number == 
continue 


print (number, end=' ') 


ORL 25 S54 67 4 S39 


3.13 BOOLEAN OPERATORS AND, OR AND NOT 


The conditional operators >, <, >=, <=, == and != can be used to form simple 
conditions such as grade >= 60. To form more complex conditions that combine 


simple conditions, use the and, or and not Boolean operators. 


Boolean Operator and 


To ensure that two conditions are both True before executing a control statement’s 
suite, use the Boolean and operator to combine the conditions. The following code 
defines two variables, then tests a condition that’s True if and only if both simple 
conditions are True—if either (or both) of the simple conditions is False, the entire 


and expression is False: 


lick here to view code image 





In [1]: gender = 'Female' 


In [2]: age = 70 





In [3]: if gender == 'Female' and age >= 65: 


print ('Senior female’) 


Senior female 





The if statement has two simple conditions: 


e gender == 'Female' determines whether a person is a female and 


e age >= 65 determines whether that person is a senior citizen. 


The simple condition to the left of the and operator evaluates first because == has 
higher precedence than and. If necessary, the simple condition to the right of and 
evaluates next, because >= has higher precedence than and. (We'll discuss shortly why 


the right side of an and operator evaluates only if the left side is True.) The entire if 





statement condition is True if and only if both of the simple conditions are True. The 


combined condition can be made clearer by adding redundant parentheses 


lick here to view code image 


(gender == 'Female') and (age >= 65) 


The table below summarizes the and operator by showing all four possible 


combinations of False and True values for expression1 and expression2—such tables 


are called truth tables: 


expressioni expression2 expressioni and expression2 





False False False 
False True False 
True False False 
True True True 
4 > i 


Boolean Operator or 


Use the Boolean or operator to test whether one or both of two conditions are True. 
The following code tests a condition that’s True if either or both simple conditions are 


True—the entire condition is False only if both simple conditions are False: 


lick here to view code image 





lions Peele semester average = 83 
igor Pee) eS fainal exam =195 
mo kelk if semester average- =- 90 or tinal exam -= 90% 





print ('Student gets an A") 


Student gets an A 


Snippet [6] also contains two simple conditions: 


e semester average >= 90 determines whether a student’s average was an A (90 


or above) during the semester, and 


e final exam >= 90 determines whether a student’s final-exam grade was an A. 


The truth table below summarizes the Boolean or operator. Operator and has higher 


precedence than or. 


expressioni expression2 expressioni or expression2 





False False False 
False True True 
True False True 
True Mewes True 
4 > 


Improving Performance with Short-Circuit Evaluation 


Python stops evaluating an and expression as soon as it knows whether the entire 
condition is False. Similarly, Python stops evaluating an or expression as soon as it 
knows whether the entire condition is True. This is called short-circuit evaluation. So 


the condition 


lick here to view code image 


gender == 'Female' and age >= 65 





stops evaluating immediately if gender is not equal to 'Female' because the entire 
expression must be False. If gender is equal to 'Female', execution continues, 


because the entire expression will be True if the age is greater than or equal to 65. 


Similarly, the condition 


lick here to view code image 





semester average >= 90 or final exam >= 90 


stops evaluating immediately if semester average is greater than or equal to 90 


because the entire expression must be True. If semester average is less than 90, 





execution continues, because the expression could still be True ifthe final_examis 


greater than or equal to 90. 


In expressions that use and, make the condition that’s more likely to be False the 
leftmost condition. In or operator expressions, make the condition that’s more likely to 
be True the leftmost condition. These techniques can reduce a program’s execution 


time. 


Boolean Operator not 


The Boolean operator not “reverses” the meaning of a condition—T rue becomes 
False and False becomes True. This is a unary operator—it has only one operand. 
You place the not operator before a condition to choose a path of execution if the 


original condition (without the not operator) is False, such as in the following code: 
lick here to view code image 


im [Vie grade: = 87 


in [8]2 af not grade == l: 


print ('The next grade is', grade) 


The next grade is 87 


Often, you can avoid using not by expressing the condition in a more “natural” or 





convenient manner. For example, the preceding if statement can also be written as 


follows: 


lick here to view code image 


male ale ograde eile 


print ('The next grade is', grade) 


The next grade is 87 


he truth table below summarizes the not operator. 





expression not expression 





False True 


True False 


The following table shows the precedence and grouping of the operators introduced so 


far, from top to bottom, in decreasing order of precedence. 


Operators Grouping 





() left to right 
xA right to left 
= ll a left to right 
+= left to right 
< <= > >= == !=  lefttoright 


not left to right 


> 


and left to right 


or left to right 


3.14 INTRO TO DATA SCIENCE: MEASURES OF 
CENTRAL TENDENCY—MEAN, MEDIAN AND MODE 


Here we continue our discussion of using statistics to analyze data with several 


additional descriptive statistics, including: 


e mean—the average value in a set of values. 
e median—the middle value when all the values are arranged in sorted order. 


e mode—the most frequently occurring value. 


These are measures of central tendency—each is a way of producing a single value 
that represents a “central” value in a set of values, i.e., a value which is in some sense 
typical of the others. 


Let’s calculate the mean, median and mode on a list of integers. The following session 
creates a list called grades, then uses the built-in sum and len functions to calculate 
the mean “by hand”—sunm calculates the total of the grades (397) and len returns the 


number of grades (5): 


lick here to view code image 


Enel Trades = e5; 99, 45,0 89,7 85] 


In [2]: sum(grades) / len (grades) 
Quete mona 


The previous chapter mentioned the descriptive statistics count and sum— 
implemented in Python as the built-in functions len and sum. Like functions min and 
max (introduced in the preceding chapter), sum and len are both examples of 
functional-style programming reductions—they reduce a collection of values to a 


single value—the sum of those values and the number of values, respectively. In ection 


.8’s class-average example, we could have deleted lines 10—15 of the script and 


replaced average in line 16 with snippet [2 ]’s calculation. 


The Python Standard Library’s statistics module provides functions for 
calculating the mean, median and mode—these, too, are reductions. To use these 


capabilities, first import the statistics module: 


TOTI umpories Sikatasiteancs 


Then, you can access the module’s functions with “statistics.” followed by the 
name of the function to call. The following calculates the grades list’s mean, median 


and mode, using the statistics module’s mean, median and mode functions: 


lick here to view code image 


In [4]: statistics.mean (grades) 
OuEl4i=s 79.4 





In [5]: statistics.median (grades) 
Cut ESk: 185 

In [6]: statistics.mode (grades) 
Owe Lolz 85 


Each function’s argument must be an iterable—in this case, the list grades. To confirm 
that the median and mode are correct, you can use the built-in sorted function to get 


a copy of grades with its values arranged in increasing order: 


In [7]: sorted(grades) 
outhi: T45 357 957 2397 93] 


The grades list has an odd number of values (5), so median returns the middle value 
(85). If the list’s number of values is even, median returns the average of the two 
middle values. Studying the sorted values, you can see that 85 is the mode because it 
occurs most frequently (twice). The mode function causes a StatisticsError for 
lists like 


[37 93; 45, So), 25, 2S] 


in which there are two or more “most frequent” values. Such a set of values is said to be 


imodal. Here, both 85 and 93 occur twice. 


3.15 WRAP-UP 


In this chapter, we discussed Python’s control statements, including if, if... else, 




















if...elif... else, while, for, break and continue. You saw that the for 








statement performs sequence-controlled iteration—it processes each item in an 
iterable, such as a range of integers, a string or a list. You used the built-in function 


range to generate sequences of integers from 0 up to, but not including, its argument, 





and to determine how many times a for statement iterates. 


You used sentinel-controlled iteration with the while statement to create a loop that 
continues executing until a sentinel value is encountered. You used built-in function 
range’s two-argument version to generate sequences of integers from the first 
argument’s value up to, but not including, the second argument’s value. You also used 
the three-argument version in which the third argument indicated the step between 


integers in a range. 


We introduced the Decimal type for precise monetary calculations and used it to 
calculate compound interest. You used f-strings and various format specifiers to create 
formatted output. We introduced the break and continue statements for altering the 
flow of control in loops. We discussed the Boolean operators and, or and not for 


creating conditions that combine simple conditions. 


Finally, we continued our discussion of descriptive statistics by introducing measures of 
central tendency—mean, median and mode—and calculating them with functions from 


the Python Standard Library’s statistics module. 


In the next chapter, you’ll create custom functions and use existing functions from 
Python’s math and random modules. We show several predefined functional- 
programming reductions and you'll see additional functional-programming 


capabilities. 


https://avxhm.se/blogs/hillO 


. Functions 


Objectives 
In this chapter, you'll 
mw Create custom functions. 


mw Import and use Python Standard Library modules, such as random and math, to 


reuse code and avoid “reinventing the wheel.” 

m Pass data between functions. 

m Generate a range of random numbers. 

mw See simulation techniques using random-number generation. 
m Seed the random number generator to ensure reproducibility. 
mw Pack values into a tuple and unpack values from a tuple. 

m Return multiple values from a function via a tuple. 


m Understand how an identifier’s scope determines where in your program you can use 
it. 


mw Create functions with default parameter values. 

m Call functions with keyword arguments. 

m Create functions that can receive any number of arguments. 
mw Use methods of an object. 


m Write and use a recursive function. 
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4.1 INTRODUCTION 


In this chapter, we continue our discussion of Python fundamentals with custom 
functions and related topics. We'll use the Python Standard Library’s random module 
and random-number generation to simulate rolling a six-sided die. We’ll combine 
custom functions and random-number generation in a script that implements the dice 
game craps. In that example, we'll also introduce Python’s tuple sequence type and use 
tuples to return more than one value from a function. We’ll discuss seeding the random 


number generator to ensure reproducibility. 


You'll import the Python Standard Library’s math module, then use it to learn about 
IPython tab completion, which speeds your coding and discovery processes. You'll 
create functions with default parameter values, call functions with keyword arguments 
and define functions with arbitrary argument lists. We’ll demonstrate calling methods 
of objects. We’ll also discuss how an identifier’s scope determines where in your 


program you can use it. 


We'll take a deeper look at importing modules. You'll see that arguments are passed-by- 
reference to functions. We'll also demonstrate a recursive function and begin 


presenting Python’s functional-style programming capabilities. 


In the Intro to Data Science section, we'll continue our discussion of descriptive 
statistics by introducing measures of dispersion—variance and standard deviation—and 
calculating them with functions from the Python Standard Library’s statistics 


module. 


4.2 DEFINING FUNCTIONS 


You've called many built-in functions (int, float, print, input, type, sum, len, 





min and max) and a few functions from the statistics module (mean, median and 
mode). Each performed a single, well-defined task. You'll often define and call custom 
functions. The following session defines a square function that calculates the square of 


its argument. Then it calls the function twice—once to square the int value 7 





(producing the int value 49) and once to square the float value 2.5 (producing the 


float value 6.25): 





lick here to view code image 


In [1]: def square (number): 
veNCa Lew Late the square of number.""" 


return number ** 2 


In [2]: square(7) 
Ouiel2A lk 49 


im lel: square (2.5) 
OU SI 16:25 


The statements defining the function in the first snippet are written only once, but may 
be called “to do their job” from many points throughout a program and as often as you 
like. Calling square with a non-numeric argument like 'hello' causes a TypeError 


because the exponentiation operator (**) works only with numeric values. 


Defining a Custom Function 


A function definition (like square in snippet [1] ) begins with the def keyword, 
followed by the function name (square), a set of parentheses and a colon (: ). Like 
variable identifiers, by convention function names should begin with a lowercase letter 


and in multiword names underscores should separate each word. 


The required parentheses contain the function’s parameter list—a comma-separated 
list of parameters representing the data that the function needs to perform its task. 
Function square has only one parameter named number—the value to be squared. If 


the parentheses are empty, the function does not use parameters to perform its task. 


The indented lines after the colon (: ) are the function’s block, which consists of an 
optional docstring followed by the statements that perform the function’s task. We'll 


soon point out the difference between a function’s block and a control statement’s suite. 


Specifying a Custom Function’s Docstring 


The Style Guide for Python Code says that the first line in a function’s block should be a 


docstring that briefly explains the function’s purpose: 


""™Calculate the square of number.""" 


To provide more detail, you can use a multiline docstring—the style guide recommends 


starting with a brief explanation, followed by a blank line and the additional details. 


Returning a Result to a Function’s Caller 


When a function finishes executing, it returns control to its caller—that is, the line of 


code that called the function. In square’s block, the return statement: 


return number ** 2 


first squares number, then terminates the function and gives the result back to the 
caller. In this example, the first caller is in snippet [2], so [Python displays the result 
in Out [2]. The second caller is in snippet [3], so [Python displays the result in 

Om [Sls 


Function calls also can be embedded in expressions. The following code calls square 


first, then print displays the result: 


lick here to view code image 


Ene ae prine (C The Sear OF Yass 1 5 square(7)) 
The square of 7 is 49 


There are two other ways to return control from a function to its caller: 


e Executing a return statement without an expression terminates the function and 
implicitly returns the value None to the caller. The Python documentation states 


that None represents the absence of a value. None evaluates to False in conditions. 


e When there’s no return statement in a function, it implicitly returns the value 


None after executing the last statement in the function’s block. 


Local Variables 


Though we did not define variables in square’s block, it is possible to do so. A 
function’s parameters and variables defined in its block are all local variables—they 
can be used only inside the function and exist only while the function is executing. 
Trying to access a local variable outside its function’s block causes a NameError, 


indicating that the variable is not defined. 


Accessing a Function’s Docstring via IPython’s Help Mechanism 


IPython can help you learn about the modules and functions you intend to use in your 
code, as well as IPython itself. For example, to view a function’s docstring to learn how 


to use the function, type the function’s name followed by a question mark (?): 


lick here to view code image 


in WS) square 

Signature: square (number) 

Docstring: Calculate the square of number. 

File: ~/Documents/examples/ch04/<ipython-input-1-7268c8f£f£93a9> 





Type: FUNCELON 


For our square function, the information displayed includes: 


e The function’s name and parameter list—known as its signature. 
e The function’s docstring. 


e The name of the file containing the function’s definition. For a function in an 
interactive session, this line shows information for the snippet that defined the 
function—the 1 in "<ipython-input-1-7268c8£f£93a9>" means snippet [1]. 


e The type of the item for which you accessed IPython’s help mechanism—in this case, 


a function. 


If the function’s source code is accessible from IPython—such as a function defined in 
the current session or imported into the session from a . py file—you can use ?? to 


display the function’s full source-code definition: 


lick here to view code image 


In [6]: square?? 
Signature: square (number) 
Source: 
def square (number): 
"""Calculate the square of number, TEn 
return number ** 2 
File: ~/Documents/examples/ch04/<ipython-input-1-7268c8f£f£93a9> 





Type: EUNCELON 


If the source code is not accessible from IPython, ?? simply shows the docstring. 


If the docstring fits in the window, IPython displays the next In [] prompt. Ifa 
docstring is too long to fit, [Python indicates that there’s more by displaying a colon (:) 
at the bottom of the window—press the Space key to display the next screen. You can 
navigate backwards and forwards through the docstring with the up and down arrow 
keys, respectively. [Python displays (END) at the end of the docstring. Press q (for 
“quit”) at any : or the (END) prompt to return to the next In [] prompt. To geta 


sense of IPython’s features, type ? at any In [] prompt, press Enter, then read the 


help documentation overview. 


4.3 FUNCTIONS WITH MULTIPLE PARAMETERS 


Let’s define a maximum function that determines and returns the largest of three values 
—the following session calls the function three times with integers, floating-point 


numbers and strings, respectively. 


lick here to view code image 


In [1]: def maximum(valuel, value2, value3): 
PEIN Ce gate the maximum of three values.""" 
max value = valuel 
if yaluez2 > max value: 


max value = value2 





re values > max value: 
max value = value3 


return max value 


En 2l: smaseamume (tee 27, 36) 


OuietZAlks 36 


int Sl: maxımnum (12.3, 45-607 9.7) 
Out Polks A56 


In [4]: maximum('yellow', 'red', '‘orange') 
Out[4]: 'yellow' 








We did not place blank lines above and below the if statements, because pressing 


return on a blank line in interactive mode completes the function’s definition. 








You also may call maximum with mixed types, such as ints and floats: 


ta Sl: emaxamumi( lS 235," 27 T1) 
Ome os 1355 


The call maximum (13.5, 'hello', 7) results in TypeError because strings and 


numbers cannot be compared to one another with the greater-than (>) operator. 


Function maximum’s Definition 


Function maximum specifies three parameters in a comma-separated list. Snippet [2]’s 


arguments 12, 27 and 36 are assigned to the parameters valuel, value2 and 


value3, respectively. 


To determine the largest value, we process one value at a time: 


e Initially, we assume that valuel contains the largest value, so we assign it to the 
local variable max_value. Of course, it’s possible that value2 or value3 contains 


the actual largest value, so we still must compare each of these with max_value. 


e The first if statement then tests value2 > max value, and if this condition is 


True assigns value2 to max value. 


e The second if statement then tests value3 > max value, and if this condition is 


True assigns value3 to max value. 


Now, max_value contains the largest value, so we return it. When control returns to 
the caller, the parameters valuel, value2 and value3 and the variable max value 


in the function’s block—which are all local variables—no longer exist. 


Python’s Built-In max and min Functions 


For many common tasks, the capabilities you need already exist in Python. For 
example, built-in max and min functions know how to determine the largest and 


smallest of their two or more arguments, respectively: 


lick here to view code image 


En (telus max yello “rodi “orange’, bluet *oreen") 
Out [6]: 'yellow' 


ae (als smash 29, TA 
Guthi: 29 


Each of these functions also can receive an iterable argument, such as a list or a string. 
Using built-in functions or functions from the Python Standard Library’s modules 
rather than writing your own can reduce development time and increase program 
reliability, portability and performance. For a list of Python’s built-in functions and 


modules, see 


ttps://docs.python.org/3/library/index. html 


4.4 RANDOM-NUMBER GENERATION 


We now take a brief diversion into a popular type of programming application— 
simulation and game playing. You can introduce the element of chance via the 


Python Standard Library’s random module. 


Rolling a Six-Sided Die 


Let’s produce 10 random integers in the range 1—6 to simulate rolling a six-sided die: 


lick here to view code image 


im [il]: import random 
To (2c ker roll im cange (lO) 
print (random.randrange (1, 7), end=' ') 


42 > 5) 46 4 6) TTS 


First, we import random so we can use the module’s capabilities. The randrange 


function generates an integer from the first argument value up to, but not including, the 





second argument value. Let’s use the up arrow key to recall the for statement, then 


press Enter to re-execute it. Notice that different values are displayed: 


lick here to view code image 


a Sits) Eor Tolain range (roy: 


print (random.randrange (1, 7), end=' ') 


AoA SA AETS 


Sometimes, you may want to guarantee reproducibility of a random sequence—for 
debugging, for example. At the end of this section, we'll use the random module’s seed 


function to do this. 


Rolling a Six-Sided Die 6,000,000 Times 


If randrange truly produces integers at random, every number in its range has an 
equal probability (or chance or likelihood) of being returned each time we call it. To 
show that the die faces 1—6 occur with equal likelihood, the following script simulates 
6,000,000 die rolls. When you run the script, each die face should occur approximately 


1,000,000 times, as in the sample output. 


lick here to view code image 


# £ig04 Ol.py 
mUMRoLL a six-sided die 6, 000,000 Cimes. Tn 


import random 


frequencyl = 0 


frequency2 = 


1 

2 

3 

4 

5 # face frequency counters 
6 

7 

8 frequency3 

9 


0 
0 
0 


frequency4 


10 frequencyd5 
11 frequency6 = 0 











12 

13 # 6,000,000 die rolls 

14 for roll ip range (6 000 000): F note underscor separators 
15 face = random.randrange(l, 7) 
16 

17 # increment appropriate face counter 
18 if face == 

19 frequencyl += 1 

20 elif face == 

21 frequency2 += 1 

22 elif face == 

23 frequency3 += 1 

24 elif face == 

25 frequency4 += 1 

26 elif face == 

27 frequency5 += 1 

28 elif face == 6: 

29 frequency6 += 1 

30 

31 print(f'Face{"Frequency":>13}"') 

32 print (ET: >A] reequen cy lta 34 

33 print (f (2:54) (irequency2: >13} 


( 
( 
( 
34 print (f'{3:>4} {frequency3:>13} 
( 
( 
( 





5) 
D) 
i) 
ie) 
5) 
x) 








35 print(£' (4354) (frequency 4: >13} 
36 prine (f"(5:>4) (frequency 5:>1 3} 
37 print (f'{6:>4} {frequency6:>13} 


lick here to view code image 


Frequency 


998686 
1001481 
B99 '9:010 
1000453 
9.9 9:953 
99:952 














The script uses nested control statements (an if elif statement nested in the for 











statement) to determine the number of times each die face appears. The for statement 
iterates 6,000,000 times. We used Python’s underscore (_ ) digit separator to make the 
value 6000000 more readable. The expression range (6,000,000) would be 
incorrect. Commas separate arguments in function calls, so Python would treat 


range (6,000,000) asacall to range with the three arguments 6, 0 and 0. 


For each die roll, the script adds 1 to the appropriate counter variable. Run the 
program, and observe the results. This program might take a few seconds to complete 


execution. As you'll see, each execution produces different results. Note that we did not 














provide an else clause in the if elif statement. 


Seeding the Random-Number Generator for Reproducibility 


Function randrange actually generates pseudorandom numbers, based on an 
internal calculation that begins with a numeric value known as a seed. Repeatedly 
calling randrange produces a sequence of numbers that appear to be random, 
because each time you start a new interactive session or execute a script that uses the 
random module’s functions, Python internally uses a different seed value. * When 
youre debugging logic errors in programs that use randomly generated data, it can be 
helpful to use the same sequence of random numbers until you’ve eliminated the logic 
errors, before testing the program with other values. To do this, you can use the 
random module’s seed function to seed the random-number generator yourself 
—this forces randrange to begin calculating its pseudorandom number sequence from 
the seed you specify. In the following session, snippets [5] and [8] produce the same 


results, because snippets [4] and [7] use the same seed (32): 


* According to the documentation, Python bases the seed value on the system clock or 
an operating-system-dependent randomness source. For applications requiring secure 
random numbers, such as cryptography, the documentation recommends using the 


secrets module, rather than the random module. 


lick here to view code image 


In [4]: random.seed (32) 
To Rui for coll in ranges (10): 
print (random.randrange (1, 7), end=' ') 


12 2) S62 A eG ol 
En) kelk for Toll anima ce: (HOJ: 


print (random.randrange (1, 7), end=' ') 


ISSO S TES ATTS 


In [7]: random.seed(32) 
in el: for roll in ranger): 
print (random.randrange (1, 7), end=' T) 


TER Oa Fd oa 


Snippet [6] generates different values because it simply continues the pseudorandom 


number sequence that began in snippet [5]. 


4.5 CASE STUDY: A GAME OF CHANCE 


In this section, we simulate the popular dice game known as “craps.” Here is the 


requirements statement: 


You roll two six-sided dice, each with faces containing one, two, three, four, five and 
six spots, respectively. When the dice come to rest, the sum of the spots on the two 
upward faces is calculated. If the sum is 7 or 11 on the first roll, you win. If the sum is 
2, 3 or 12 on the first roll (called “craps”), you lose (i.e., the “house” wins). If the sum is 
4, 5, 6, 8, 9 or 10 on the first roll, that sum becomes your “point.” To win, you must 
continue rolling the dice until you “make your point” (i.e., roll that same point value). 


You lose by rolling a 7 before making your point. 


The following script simulates the game and shows several sample executions, 
illustrating winning on the first roll, losing on the first roll, winning on a subsequent 


roll and losing on a subsequent roll. 


lick here to view code image 


# £ig04 02.py 
"wumsimulating the dice game Craps, Vun 


import random 


I 
2 
3 
4 
> def roll dice): 
6 
7 
8 
9 





"""Roll two dice and return their face values as a tuple.""™" 
diel = random.randrange(1, 7) 
die2 = random.randrange(1l, 7) 
return (diel, die2) # pack die face values into a tuple 
10 
I1 det display “dice (dice): 
12 """Display one roll of the two dice TNT 
T3 diel, die2 = dic # unpack the tuple into variables diel and 





14 print(f'Player rolled {diel} + {die2} = {sum(dice) }') 


15 

16 die values = roli arcee kins: srold 

I7 display dice(die values) 

18 

19 # determine game status and point, based on first roll 




















20 sum of dice = sum(die values) 
21 
22 if sùm of dice in (7, DI): # win 
23 game status = "WON" 
24 elif sum of dice in (2, 3, 12): # lose 
25 game_status = VLOS T" 
26 else: # remember point 
27 game_status = 'CONTINUE' 
28 my point = sum_of dice 
29 PRIME Point isi, my pOint) 
30 
31 # continue rolling until player wins or loses 
32 while game status == "CONTINUE": 
33 die values = colidice 
34 display dice(die values) 
35 sum of dice = sum(die values) 
36 
37 ift süm Or dice -= my pointa i win by making point 
38 game status = 'WON' 
39 Suse sum ot aC S a —— ails: Tose by colling 7 
40 game status = LOST) 
41 
42 # display "wins” or "loses” message 
43 if game status == “WONN: 
44 print (Plays r wins”) 
45 else: 
46 print('Player loses') 
4 > 








lick here to view code image 


Player rolled 2 t 5 


Player wins 


Player rolled 1 + 2 


Player loses 





lick here to view code image 


Player rolled 5 
Point as 9 

Player rolled 4 
Player rolled 2 
Player rolled 5 





Player wins 


Player rolled 1 + 5 


POs tS) 56 
Player rolled 1 + 6 


Player loses 





Function roll dice—Returning Multiple Values Via a Tuple 


Function roll dice (lines 5—9) simulates rolling two dice on each roll. The function 
is defined once, then called from several places in the program (lines 16 and 33). The 
empty parameter list indicates that roll dice does not require arguments to perform 


its task. 


The built-in and custom functions you’ve called so far each return one value. 
Sometimes it’s useful to return more than one value, as in roll dice, which returns 
both die values (line 9) as a tuple—an immutable (that is, unmodifiable) sequences 


of values. To create a tuple, separate its values with commas, as in line 9: 
(diel, die2) 


This is known as packing a tuple. The parentheses are optional, but we recommend 


using them for clarity. We discuss tuples in depth in the next chapter. 


Function display dice 


To use a tuple’s values, you can assign them to a comma-separated list of variables, 
which unpacks the tuple. To display each roll of the dice, the function display dice 
(defined in lines 11-14 and called in lines 17 and 34) unpacks the tuple argument it 
receives (line 13). The number of variables to the left of = must match the number of 
elements in the tuple; otherwise, a ValueError occurs. Line 14 prints a formatted 


string containing both die values and their sum. We calculate the sum of the dice by 


passing the tuple to the built-in sum function—like a list, a tuple is a sequence. 


Note that functions roll dice and display dice each begin their blocks with a 
docstring that states what the function does. Also, both functions contain local variables 
diel and die2. These variables do not “collide,” because they belong to different 


functions’ blocks. Each local variable is accessible only in the block that defined it. 


First Roll 


When the script begins executing, lines 16—17 roll the dice and display the results. Line 
20 calculates the sum of the dice for use in lines 22—29. You can win or lose on the first 
roll or any subsequent roll. The variable game_status keeps track of the win/loss 


status. 


The in operator in line 22 


SUMO Cees im (y LT) 





tests whether the tuple (7, 11) contains sum of dice’s value. If this condition is 
True, you rolled a 7 or an 11. In this case, you won on the first roll, so the script sets 
game status to 'WON'. The operator’s right operand can be any iterable. There’s also 
a not in operator to determine whether a value is not in an iterable. The preceding 


concise condition is equivalent to 


lick here to view code image 


(sum_of dice == 7) or (sum of dice == 11) 


Similarly, the condition in line 24 


sum or dreer in (ie >97 12) 





tests whether the tuple (2, 3, 12) contains sum of dice’s value. If so, you lost on 


the first roll, so the script sets game_status to 'LOST'. 


For any other sum of the dice (4, 5, 6, 8, 9 or 10): 


e line 27 sets game status to 'CONTINUE' so you can continue rolling 


e line 28 stores the sum of the dice in my_point to keep track of what you must roll 


to win and 


e line 29 displays my point. 


Subsequent Rolls 


If game_ status is equal to 'CONTINUE' (line 32), you did not win or lose, so the 


while statement’s suite (lines 33—40) executes. Each loop iteration calls roll dice, 





displays the die values and calculates their sum. If sum of dice is equal tomy point 
(line 37) or 7 (line 39), the script sets game_statusto'WON' or 'LOST', 
respectively, and the loop terminates. Otherwise, the while loop continues executing 


with the next roll. 


Displaying the Final Results 





When the loop terminates, the script proceeds to the if else statement (lines 43—46), 
which prints 'Player wins'ifgame statusis 'WON',or'Player loses' 


otherwise. 


4.6 PYTHON STANDARD LIBRARY 


Typically, you write Python programs by combining functions and classes (that is, 
custom types) that you create with preexisting functions and classes defined in 
modules, such as those in the Python Standard Library and other libraries. A key 


programming goal is to avoid “reinventing the wheel.” 


A module is a file that groups related functions, data and classes. The type Decimal 
from the Python Standard Library’s decimal module is actually a class. We introduced 
classes briefly in hapter 1 and discuss them in detail in the “Object-Oriented 
Programming” chapter. A package groups related modules. In this book, you'll work 
with many preexisting modules and packages, and you'll create your own modules—in 
fact, every Python source-code (. py) file you create is a module. Creating packages is 
beyond this book’s scope. They’re typically used to organize a large library’s 
functionality into smaller subsets that are easier to maintain and can be imported 
separately for convenience. For example, the matplot1ib visualization library that we 
usein ection 5.17 has extensive functionality (its documentation is over 2300 pages), 


so we'll import only the subsets we need in our examples (pyplot and animation). 


The Python Standard Library is provided with the core Python language. Its packages 


and modules contain capabilities for a wide variety of everyday programming tasks. ° 


ou can see a complete list of the standard library modules at 


* The Python Tutorial refers to this as the batteries included approach. 


pips: //does python ,org/3/library/ 


You’ve already used capabilities from the decimal, statistics and random 
modules. In the next section, you'll use mathematics capabilities from the math 
module. You'll see many other Python Standard Library modules throughout the book’s 


examples, including many of those in the following table: 


Some popular Python Standard Library modules 





math—Common math constants 


and operations. 


collections—Data structures So MGM SHU Seta 


beyond lists, tuples, dictionaries and SEMOUL 


sets. eee 
(OeOA AILS, OSCAS, CaM a e— 


Cryptography modules—Encrypting i ine IDES RUE) FNS 


data for secure transmission. 
random—Pseudorandom numbers. 


csv—Processing comma-separated ; 
l , re—Regular expressions for 
value files (like those in Excel). f 
pattern matching. 


datetime—Date and time ; i 
, ; sqlite3—SQLite relational 
manipulations. Also modules t ime 
database access. 
and calendar. 


statistics—Mathematical 
decimal—Fixed-point and floating- re A 
, ; ae , statistics functions such as mean, 
point arithmetic, including monetary 
i median, mode and variance. 
calculations. 


ee string—String processing. 
doctest—Embed validation tests 


and expected results in docstrings for < ys—Command-line argument 


simple unit testing. 


gettext and locale— 
Internationalization and localization 


modules. 


json—JavaScript Object Notation 
(JSON) processing used with web 


services and NoSQL document 


processing; standard input, 
standard output and standard 


error streams. 


tkinter—Graphical user 
interfaces (GUIs) and canvas-based 


graphics. 


databases. 
turtle—Turtle graphics. 


webbrowser—For conveniently 
displaying web pages in Python 
apps. 


4.7 MATH MODULE FUNCTIONS 


The math module defines functions for performing various common mathematical 
calculations. Recall from the previous chapter that an import statement of the 


following form enables you to use a module’s definitions via the module’s name and a 
dot (.): 
Tapi 


import math 


For example, the following snippet calculates the square root of 900 by calling the math 





module’s sqrt function, which returns its result as a float value: 


Ta [PA] Bs 
Ont (2 


Math. sqre (900) 
SOM 0 


Similarly, the following snippet calculates the absolute value of -10 by calling the math 





module’s fabs function, which returns its result as a float value: 


Ta Pen: 
Qutli: 


mathe tabs (SHO) 
LOO 


Some math module functions are summarized below—you can view the complete list at 


ttps://docs.python.org/3/library/math. html 


Function Description 





ee11(9.2) is 


z 10.0 
Rounds x to the smallest integer not less 
ceil (XxX) h 
ae ceil(-9.8) is 
=2.@ 
rloor (9.2) Is 
o 9.0 
Rounds x to the largest integer not greater 
E Loor e) h 
tan floor (-9.8) is 
=110,0 
Sam (x) Trigonometric sine of x (x in radians) sin(0. 0) 1S 0.0 
cos (x) Trigonometric cosine of x (x in radians) cos (0-0) 1.0 
tan ka Trigonometric tangent of x (x in radians) can OMO 415 020 
exo (i0) 1S 
Zo 119282 
exp (xX) Exponential function e* 
exp(2 20) 1S 
To 3839056 


log (2.718282) 
isa. 0 


log (x) Natural logarithm of x (base e) 


log10(x) Logarithm of x (base 10) 





pow (x, 3 
x raised to power y (x ) 
y) 
Sar t (x) square root of x 
Absolute value of x—always returns a float. 
eae Python also has the built-in function abs, 
AO SN OS 
which returns an int ora float, based on 
its argument. 
fmod (Xx, 





Remainder of x/y as a floating-point number 


log (7.389056) 
S20 


1ogi0(10. 0) 16 
1O 


Koo LO GO ORO) 
is2.0 


Own (2a Ol meet een OD) 
is 128.0 


Owe CSL Oi mero} 
S30 


sare (900.0) is 
30), O 


sqrt (9.0) 41S 
360 


fabs (5.1) 16 
Dye 


fabs (-5.1) is 
Sel 


EMOCIO 3, 
ARONS IRS 


4.8 USING IPYTHON TAB COMPLETION FOR DISCOVERY 


You can view a module’s documentation in IPython interactive mode via tab 
completion—a discovery feature that speeds your coding and discovery processes. 
After you type a portion of an identifier and press Tab, IPython completes the identifier 
for you or provides a list of identifiers that begin with what you’ve typed so far. This 
may vary based on your operating system platform and what you have imported into 


your IPython session: 


lick here to view code image 


Im [il] import math 


im [2]: ma<fab> 


map Smacro 6smarkdown 
math smagic smatplotlib 
max () zman 


You can scroll through the identifiers with the up and down arrow keys. As you do, 
IPython highlights an identifier and shows it to the right of the In [] prompt. 


Viewing Identifiers in a Module 


To view a list of identifiers defined in a module, type the module’s name and a dot (.), 


then press Tab: 


lick here to view code image 


tn [3]; math. <Tab> 


acos() atan() copysign() e expml () 
acosh() atan2 () ecos) erf () fabs () 

asin() atanh () cosh () erfc() tactoria ll] > 
asinh() ceil () degrees () exp () floor) 








If there are more identifiers to display than are currently shown, IPython displays the > 





symbol (on some platforms) at the right edge, in this case to the right of factorial (). 
You can use the up and down arrow keys to scroll through the list. In the list of 


identifiers: 


e Those followed by parentheses are functions (or methods, as you'll see later). 


e Single-word identifiers (such as Employee) that begin with an uppercase letter and 


multiword identifiers in which each word begins with an uppercase letter (such as 
CommissionEmployee) represent class names (there are none in the preceding 
list). This naming convention, which the Style Guide for Python Code recommends, 
is known as CamelCase because the uppercase letters stand out like a camel’s 


humps. 


e Lowercase identifiers without parentheses, such as pi (not shown in the preceding 
list) and e, are variables. The identifier pi evaluates to 3.141592653589793, and 
the identifier e evaluates to 2.718281828459045. Inthe math module, pi and e 





represent the mathematical constants x and e, respectively. 


Python does not have constants, although many objects in Python are immutable 
(nonmodifiable). So even though pi and e are real-world constants, you must not 
assign new values to them, because that would change their values. To help distinguish 
constants from other variables, the style guide recommends naming your custom 


constants with all capital letters. 


Using the Currently Highlighted Function 


As you navigate through the identifiers, if you wish to use a currently highlighted 
function, simply start typing its arguments in parentheses. IPython then hides the 
autocompletion list. If you need more information about the currently highlighted item, 


you can view its docstring by typing a question mark (?) following the name and 





pressing Enter to view the help documentation. The following shows the fabs 


function’s docstring: 


lick here to view code image 


in a math- faHs? 
Docstring: 
fabs (x) 


Return the absolute value of the float x. 


Type: puiltin Rune eLOn or method 





The builtin function or method shown above indicates that fabs is part of a 








Python Standard Library module. Such modules are considered to be built into Python. 


In this case, fabs is a built-in function from the math module. 





4.9 DEFAULT PARAMETER VALUES 


When defining a function, you can specify that a parameter has a default parameter 
value. When calling the function, if you omit the argument for a parameter with a 
default parameter value, the default value for that parameter is automatically passed. 


Let’s define a function rectangle area with default parameter values: 


lick here to view code image 


im [js def rectangle area(length=2, width=3): 





WR Svea a rectangles: area." 


return length * width 


You specify a default parameter value by following a parameter’s name with an = and a 
value—in this case, the default parameter values are 2 and 3 for Length and width, 
respectively. Any parameters with default parameter values must appear in the 


parameter list to the right of parameters that do not have defaults. 


The following call to rectangle area has no arguments, so [Python uses both 


default parameter values as if you had called rectangle area(2, 3): 


in 2: rectangle areal) 
Oats [BZ IG 


The following call to rectangle area has only one argument. Arguments are 
assigned to parameters from left to right, so 10 is used as the length. The interpreter 
passes the default parameter value 3 for the width as if you had called 


rectangle area (10; 3): 


ine [SJ rectangle area(T0) 
OURS I 30 


The following call to rectangle area has arguments for both length and width, so 


IPython- ignores the default parameter values: 


ine [Aes rectangle area(T0 S) 
out FAT S10 


4.10 KEYWORD ARGUMENTS 


When calling functions, you can use keyword arguments to pass arguments in any 
order. To demonstrate keyword arguments, we redefine the rectangle area 


function—this time without default parameter values: 


lick here to view code image 


In [els def rectangle area(length, width): 
VO VReSti En a rectangles areas Tum 
return length * width 


Each keyword argument in a call has the form parametername=value. The following 
call shows that the order of keyword arguments does not matter—they do not need to 


match the corresponding parameters’ positions in the function definition: 


lick here to view code image 


In [2]: rectangle area(width—o5, length=10) 
Cuttoli 50 


In each function call, you must place keyword arguments after a function’s positional 
arguments—that is, any arguments for which you do not specify the parameter name. 
Such arguments are assigned to the function’s parameters left-to-right, based on the 
argument’s positions in the argument list. Keyword arguments are also helpful for 
improving the readability of function calls, especially for functions with many 


arguments. 


4.11 ARBITRARY ARGUMENT LISTS 


Functions with arbitrary argument lists, such as built-in functions min and max, 


can receive any number of arguments. Consider the following min call: 


minsen Ton Olan S57 2) 


The function’s documentation states that min has two required parameters (named 
arg1 and arg2) and an optional third parameter of the form *args, indicating that 
the function can receive any number of additional arguments. The * before the 
parameter name tells Python to pack any remaining arguments into a tuple that’s 


passed to the args parameter. In the call above, parameter arg1 receives 88, 


parameter arg2 receives 75 and parameter args receives the tuple (96, 55, 83). 


Defining a Function with an Arbitrary Argument List 


Let’s define an average function that can receive any number of arguments: 


lick here to view code image 


In [1]: def average(*args): 


return sum(args) / len(args) 


The parameter name args is used by convention, but you may use any identifier. If the 
function has multiple parameters, the * args parameter must be the rightmost 


parameter. 
Now, let’s call average several times with arbitrary argument lists of different lengths: 


lick here to view code image 


In [2]: average(5, 10) 
Qutk2l s S 


in ol: “average (o, 10, 15) 
Carlo rono 


Im Al: avecage (5, 10, lS, 20) 
oute PS5 


To calculate the average, divide the sum of the args tuple’s elements (returned by 
built-in function sum) by the tuple’s number of elements (returned by built-in function 
len). Note in our average definition that if the length of args is 0, a 
ZeroDivisionError occurs. In the next chapter, you'll see how to access a tuple’s 


elements without unpacking them. 


Passing an Iterable’s Individual Elements as Function Arguments 


You can unpack a tuple’s, list’s or other iterable’s elements to pass them as individual 
function arguments. The * operator, when applied to an iterable argument in a 
function call, unpacks its elements. The following code creates a five-element grades 
list, then uses the expression *grades to unpack its elements as average’s 


arguments: 


lick here to view code image 


in [jt grades = 188; 75, 96, 557 83] 


In [6]: average (*grades) 
Oui [konis. 9e4 


The call shown above is equivalent to average (88, 75, 96, 55, 83). 


4.12 METHODS: FUNCTIONS THAT BELONG TO 
OBJECTS 


A method is simply a function that you call on an object using the form 


object _ name.method name (arguments) 


For example, the following session creates the string variable s and assigns it the string 
object 'Hello'. Then the session calls the objects Lower and upper methods, which 
produce new strings containing all-lowercase and all-uppercase versions of the original 


string, leaving s unchanged: 


lick here to view code image 


in (ljs s = 'Hello! 
In [2]: s.lower() # call lower method on string object s 
Cue 2)= “helio” 


Ta PS Supper) 








Out [3]: “HELLO 
ta Ale s 
Our [4]: “Hello! 


The Python Standard Library reference at 


ttps://docs.python.org/3/library/index.html 


describes the methods of built-in types and the types in the Python Standard Library. 
In the “Object-Oriented Programming” chapter, you'll create custom types called 


classes and define custom methods that you can call on objects of those classes. 


4.13 SCOPE RULES 


Each identifier has a scope that determines where you can use it in your program. For 


that portion of the program, the identifier is said to be “in scope.” 


Local Scope 


A local variable’s identifier has local scope. It’s “in scope” only from its definition to 
the end of the function’s block. It “goes out of scope” when the function returns to its 


caller. So, a local variable can be used only inside the function that defines it. 


Global Scope 


Identifiers defined outside any function (or class) have global scope—these may 
include functions, variables and classes. Variables with global scope are known as 
global variables. Identifiers with global scope can be used in a . py file or interactive 


session anywhere after they’re defined. 


Accessing a Global Variable from a Function 


You can access a global variable’s value inside a function: 


lick here to view code image 


in Pi: det waccesismcillobalk()<: 


print('x printed from access global"); x) 


in [Sits access oloba) 


x printed from access global: 7 


However, by default, you cannot modify a global variable in a function—when you first 


assign a value to a variable in a function’s block, Python creates a new local variable: 


lick here to view code image 


maA: derf tcy co modity globali): 
a ea Sia) 
print ("x printed Erom eey tco modity glopal:G i x) 


In [Sj] try to modify global) 
x printed trom try to modify global: 3.5 


ta lele x 
Ome leds 7 


In function try to modify global’s block, the local x shadows the global x, 





making it inaccessible in the scope of the function’s block. Snippet [6] shows that 


global variable x still exists and has its original value (7) after function 





try to modify global executes. 


To modify a global variable in a function’s block, you must use a global statement to 


declare that the variable is defined in the global scope: 


lick here to view code image 


taie det modify globali 
global x 
x = 'hello' 
prine x printed from imodaky sgikobad:: %, x) 


in Peis modify globai 
x printed from modify global: hello 





TOONE 
Outlol: "hed do 


Blocks vs. Suites 


You’ve now defined function blocks and control statement suites. When you create a 
variable in a block, it’s local to that block. However, when you create a variable in a 
control statement’s suite, the variable’s scope depends on where the control statement 


is defined: 


e Ifthe control statement is in the global scope, then any variables defined in the 


control statement have global scope. 


e Ifthe control statement is in a function’s block, then any variables defined in the 


control statement have local scope. 


We'll continue our scope discussion in the “Object-Oriented Programming” chapter 


when we introduce custom classes. 


Shadowing Functions 


In the preceding chapters, when summing values, we stored the sum in a variable 
named total. The reason we did this is that sum is a built-in function. If you define a 
variable named sum, it shadows the built-in function, making it inaccessible in your 
code. When you execute the following assignment, Python binds the identifier sum to 
the int object containing 15. At this point, the identifier sum no longer references the 


built-in function. So, when you try to use sum as a function, a TypeError occurs: 


lick here to view code image 





TypeError Traceback (most recent call last 
ipython-input-12-1237d97a65fb> in <module>() 
——--> J sumes(e (alos 11) 








TypeError: ‘int' object is not callable 











Statements at Global Scope 


In the scripts you’ve seen so far, we’ve written some statements outside functions at the 
global scope and some statements inside function blocks. Script statements at global 
scope execute as soon as they’re encountered by the interpreter, whereas statements in 


a block execute only when the function is called. 


4.14 IMPORT: A DEEPER LOOK 


You’ve imported modules (such as math and random) with a statement like: 

import module name 
then accessed their features via each module’s name and a dot (.). Also, you’ve 
imported a specific identifier from a module (such as the decimal module’s Decimal 


type) with a statement like: 


from module name import identifier 


then used that identifier without having to precede it with the module name and a dot 


(.). 


Importing Multiple Identifiers from a Module 


Using the from import statement you can import a comma-separated list of identifiers 
from a module then use them in your code without having to precede them with the 


module name and a dot (.): 


lick here to view code image 


in lt; fom math Importe ceilp bloor 


Aah eie ecer (03) 
One (Zale alal 


my [Sake E koor (T0. m) 


Ourio: 0 


Trying to use a function that’s not imported causes a NameError, indicating that the 


name is not defined. 


Caution: Avoid Wildcard Imports 


You can import all identifiers defined in a module with a wildcard import of the form 
from modulename import * 
This makes all of the module’s identifiers available for use in your code. Importing a 


module’s identifiers with a wildcard import can lead to subtle errors—it’s considered a 


dangerous practice that you should avoid. Consider the following snippets: 


In [4]: e = 'hello' 

Lae Sii Erom match ampomits es 
Tael e 

Outlol: 2- 718281828459045 





Initially, we assign the string 'hello' to a variable named e. After executing snippet 
[5] though, the variable e is replaced, possibly by accident, with the math module’s 


constant e, representing the mathematical floating-point value e. 


Binding Names for Modules and Module Identifiers 


Sometimes it’s helpful to import a module and use an abbreviation for it to simplify 
your code. The import statement’s as clause allows you to specify the name used to 
reference the module’s identifiers. For example, in ection 3.14 we could have imported 


the statistics module and accessed its mean function as follows: 


lick here to view code image 


in Sli mmpore statisties as stats 

in | [eso nades: '— 1857 S37 45), Sa, 99] 
In [9]: stats.mean (grades) 

Ome RIES "Si0is 6 





As you'll see in later chapters, import as is frequently used to import Python libraries 





with convenient abbreviations, like stats for the statistics module. As another 


example, we'll use the numpy module which typically is imported with 


import numpy as np 


Library documentation often mentions popular shorthand names. 


Typically, when importing a module, you should use import or import as statements, 





then access the module through the module name or the abbreviation following the as 
keyword, respectively. This ensures that you do not accidentally import an identifier 


that conflicts with one in your code. 


4.15 PASSING ARGUMENTS TO FUNCTIONS: A DEEPER 
LOOK 


Let’s take a closer look at how arguments are passed to functions. In many 
programming languages, there are two ways to pass arguments—pass-by-value and 
pass-by-reference (sometimes called call-by-value and call-by-reference, 


respectively): 


e With pass-by-value, the called function receives a copy of the argument’s value and 
works exclusively with that copy. Changes to the function’s copy do not affect the 


original variable’s value in the caller. 


e With pass-by-reference, the called function can access the argument’s value in the 


caller directly and modify the value if it’s mutable. 


Python arguments are always passed by reference. Some people call this pass-by- 
object-reference, because “everything in Python is an object.” * When a function call 
provides an argument, Python copies the argument object’s reference—not the object 
itself—into the corresponding parameter. This is important for performance. Functions 
often manipulate large objects—frequently copying them would consume large amounts 


of computer memory and significantly slow program performance. 


3 Even the functions you defined in this chapter and the classes (custom types) youll 


define in later chapters are objects in Python. 


Memory Addresses, References and “Pointers” 


You interact with an object via a reference, which behind the scenes is that object’s 
address (or location) in the computer’s memory—sometimes called a “pointer” in other 


languages. After an assignment like 


the variable x does not actually contain the value 7. Rather, it contains a reference to an 
object containing 7 stored elsewhere in memory. You might say that x “points to” (that 


is, references) the object containing 7, as in the diagram below: 


Variable Object 
X 


Built-In Function id and Object Identities 


Let’s consider how we pass arguments to functions. First, let’s create the integer 


variable x mentioned above—shortly we'll use x as a function argument: 


Now x refers to (or “points to”) the integer object containing 7. No two separate objects 
can reside at the same address in memory, so every object in memory has a unique 


address. Though we can’t see an object’s address, we can use the built-in id function 


to obtain a unique int value which identifies only that object while it remains in 


memory (you'll likely get a different value when you run this on your computer): 


o (erie ae e) 
Outil2l: 4350477840 


The integer result of calling id is known as the object’s identity. 4 No two objects in 
memory can have the same identity. We'll use object identities to demonstrate that 


objects are passed by reference. 


4 According to the Python documentation, depending on the Python implementation 
youre using, an objects identity may be the objects actual memory address, but this is 


not required. 


Passing an Object to a Function 


Let’s define a cube function that displays its parameter’s identity, then returns the 


parameter’s value cubed: 


lick here to view code image 


In [3]: def cube (number): 


Oucalnite (Mace (nhamioese) <1, id(number) ) 





return number ** 3 


Next, let’s call cube with the argument x, which refers to the integer object containing 
as 


In [4]: cube (Ge) 
id(number): 4350477840 
Out[4]: 343 


The identity displayed for cube’s parameter number—4350477840—is the same as 
that displayed for x previously. Since every object has a unique identity, both the 
argument x and the parameter number refer to the same object while cube executes. 
So when function cube uses its parameter number in its calculation, it gets the value of 


number from the original object in the caller. 


Testing Object Identities with the i s Operator 


You also can prove that the argument and the parameter refer to the same object with 


Python’s is operator, which returns True if its two operands have the same identity: 


lick here to view code image 





In [5]: def cube(number): 
printe (number ts zot, number is x) F x isa global variab 


return number ** 3 


In [6]: cube(x) 
number is x: True 
Outlok 343 








Immutable Objects as Arguments 


When a function receives as an argument a reference to an immutable (unmodifiable) 


object—such as an int, float, string or tuple—even though you have direct access 





to the original object in the caller, you cannot modify the original immutable object’s 
value. To prove this, first lets have cube display id (number) before and after 


assigning a new object to the parameter number via an augmented assignment: 


lick here to view code image 


In [7]: def cube(number): 
print ('id(number) before modifying number:', id(number) ) 





number **= 3 
print ('id(number) after modifying number:', id(number) ) 


return number 


in ses Cubes) 

id(number) before modifying number: 4350477840 
id(number) after modifying number: 4396653744 
OUE TS Iie 343 


When we call cube (x), the first print statement shows that id (number) initially is 


the same as id (x) in snippet [2]. Numeric values are immutable, so the statement 
number **= 3 


actually creates a new object containing the cubed value, then assigns that object’s 


reference to parameter number. Recall that if there are no more references to the 


original object, it will be garbage collected. Function cube’s second print statement 
shows the new object’s identity. Object identities must be unique, so number must 
refer to a different object. To show that x was not modified, we display its value and 


identity again: 


lick here to view code image 


Ikel SMS prine (ik sei eel aE = EEN 
x = 7; id(x) = 4350477840 


Mutable Objects as Arguments 


In the next chapter, we'll show that when a reference to a mutable object like a list is 


passed to a function, the function can modify the original object in the caller. 


4.16 RECURSION 


Let’s write a program to perform a famous mathematical calculation. Consider the 
factorial of a positive integer n, which is written n! and pronounced “n factorial.” This 


is the product 


ial) ee Agios AI) we ale a at 


with 1! equal to 1 and o! defined to be 1. For example, 5! is the product 5 - 4- 3-2-1, 


which is equal to 120. 


Iterative Factorial Approach 


You can calculate 5! iteratively with a for statement, as in: 


lick here to view code image 
in Ves factorial = 1 
Tale for number in ranged 0 =F 


factorial *= number 


tn lal: factorial 
Out 3l: 120 


Recursive Problem Solving 


Recursive problem-solving approaches have several elements in common. When you 
call a recursive function to solve a problem, it’s actually capable of solving only the 
simplest case(s), or base case(s). If you call the function with a base case, it 
immediately returns a result. If you call the function with a more complex problem, it 
typically divides the problem into two pieces—one that the function knows how to do 
and one that it does not know how to do. To make recursion feasible, this latter piece 
must be a slightly simpler or smaller version of the original problem. Because this new 
problem resembles the original problem, the function calls a fresh copy of itself to work 
on the smaller problem—this is referred to as a recursive call and is also called the 
recursion step. This concept of separating the problem into two smaller portions is a 


form of the divide-and-conquer approach introduced earlier in the book. 


The recursion step executes while the original function call is still active (i.e., it has not 
finished executing). It can result in many more recursive calls as the function divides 
each new subproblem into two conceptual pieces. For the recursion to eventually 
terminate, each time the function calls itself with a simpler version of the original 
problem, the sequence of smaller and smaller problems must converge on a base case. 
When the function recognizes the base case, it returns a result to the previous copy of 
the function. A sequence of returns ensues until the original function call returns the 


final result to the caller. 


Recursive Factorial Approach 


You can arrive at a recursive factorial representation by observing that n! can be written 


as: 


For example, 5! is equal to 5 - 4!, as in: 


Si =) ee EE ae ee ee aL 
Bye ey Ee se iL) 
BS eye A) 


Visualizing Recursion 


The evaluation of 5! would proceed as shown below. The left column shows how the 
succession of recursive calls proceeds until 1! (the base case) is evaluated to be 1, which 
terminates the recursion. The right column shows from bottom to top the values 


returned from each recursive call to its caller until the final value is calculated and 





Final value = 120 


H, =5 * 24 = 120 is returned 
4! = 4 * 6 = 24 is returned 
a 3! =3 * 2 =6 is returned 
EEN 2! =2 * | =2 is returned 
= returned 





(a) Sequence of recursive calls (b) Values returned from | recursive call 


Implementing a Recursive Factorial Function 


The following session uses recursion to calculate and display the factorials of the 


integers o through 10: 


lick here to view code image 


In 


In 


oil 
aal 
Pal 
ot 
4! 
De 
6! 
Ha 
she 
oH 


ARO 


[4]: 


Seles 


NP FR es 


6 


24 

120 

720 

5040 

40320 

362880 
3628800 


def factorial (number): 


UV Return factorial of number. “Tn 
if number <= 1: 
return 1 


return number * factorial (number - 1) # recursive call 


i in range (11): 


prine (ae a a N Eac Eor ra Maha) 





Snippet [4]’s recursive function factorial first determines whether the terminating 





condition number <= 1 is True. If this condition is True (the base case), factorial 
returns 1 and no further recursion is necessary. If number is greater than 1, the second 


return statement expresses the problem as the product of number and a recursive 








callto factorial that evaluates factorial (number - 1). This is a slightly 





smaller problem than the original calculation, factorial (number). Note that 





function factorial must receive a nonnegative argument. We do not test for this 


Case. 


The loop in snippet [5] calls the factorial function for the values from 0 through 





10. The output shows that factorial values grow quickly. Python does not limit the size 


of an integer, unlike many other programming languages. 


Indirect Recursion 


A recursive function may call another function, which may, in turn, make a call back to 
the recursive function. This is known as an indirect recursive call or indirect 
recursion. For example, function A calls function B, which makes a call back to 
function A. This is still recursion because the second call to function A is made while the 
first call to function A is active. That is, the first call to function A has not yet finished 
executing (because it is waiting on function B to return a result to it) and has not 


returned to function A’s original caller. 


Stack Overflow and Infinite Recursion 


Of course, the amount of memory in a computer is finite, so only a certain amount of 
memory can be used to store activation records on the function-call stack. If more 
recursive function calls occur than can have their activation records stored on the stack, 
a fatal error known as stack overflow occurs. This typically is the result of infinite 
recursion, which can be caused by omitting the base case or writing the recursion step 
incorrectly so that it does not converge on the base case. This error is analogous to the 


problem of an infinite loop in an iterative (nonrecursive) solution. 


4.17 FUNCTIONAL-STYLE PROGRAMMING 


Like other popular languages, such as Java and C#, Python is not a purely functional 
language. Rather, it offers “functional-style” features that help you write code which is 
less likely to contain errors, more concise and easier to read, debug and modify. 


Functional-style programs also can be easier to parallelize to get better performance on 


today’s multi-core processors. The chart below lists most of Python’s key functional- 


style programming capabilities and shows in parentheses the chapters in which we 


initially cover many of them. 


Functional-style programming topics 





avoiding side effects : 
: 8 (4) generator functions 


closures higher-order 


declarative programming functions (5) 


(4) immutability (4) 


1 . ; : 
SCORES OC) internal iteration (4) 


dictionary iterators (3) 


comprehensions (6) 


itertools module 


filter/map/reduce (5) (16) 


functools module . 
lambda expressions 


(5) 


generator expressions (5) 


lazy evaluation (5) 
list comprehensions (5) 


operator module (5, 
11, 16) 


pure functions (4) 
range function (3, 4) 
reductions (3, 5) 


set comprehensions (6) 


We cover most of these features throughout the bbok—many with code examples and 


others from a literacy perspective. You’ve already used list, string and built-in function 


range iterators with the for statement, and several reductions (functions sum, len, 


min and max). We discuss declarative programming, immutability and internal 


iteration below. 


What vs. How 


As the tasks you perform get more complicated, your code can become harder to read, 


debug and modify, and more likely to contain errors. Specifying how the code works 


can become complex. 


Functional-style programming lets you simply say what you want to do. It hides many 


etails of how to perform each task. Typically, library code handles the how for you. As 


you'll see, this can eliminate many errors. 


Consider the for statement in many other programming languages. Typically, you 
must specify all the details of counter-controlled iteration: a control variable, its initial 
value, how to increment it and a loop-continuation condition that uses the control 
variable to determine whether to continue iterating. This style of iteration is known as 
external iteration and is error-prone. For example, you might provide an incorrect 


initializer, increment or loop-continuation condition. External iteration mutates (that 





is, modifies) the control variable, and the for statement’s suite often mutates other 
variables as well. Every time you modify variables you could introduce errors. 
Functional-style programming emphasizes immutability. That is, it avoids operations 


that modify variables’ values. We'll say more in the next chapter. 


Python’s for statement and range function hide most counter-controlled iteration 
details. You specify what values range should produce and the variable that should 
receive each value as it’s produced. Function range knows how to produce those 
values. Similarly, the for statement knows how to get each value from range and how 
to stop iterating when there are no more values. Specifying what, but not how, is an 


important aspect of internal iteration—a key functional-style programming concept. 


The Python built-in functions sum, min and max each use internal iteration. To total 
the elements of the list grades, you simply declare what you want to do—that is, 
sum (grades). Function sum knows how to iterate through the list and add each 
element to the running total. Stating what you want done rather than programming 


how to do it is known as declarative programming. 


Pure Functions 


In pure functional programming language you focus on writing pure functions. A pure 
function’s result depends only on the argument(s) you pass to it. Also, given a 
particular argument (or arguments), a pure function always produces the same result. 
For example, built-in function sum’s return value depends only on the iterable you pass 
to it. Given alist [1, 2, 3], sum always returns 6 no matter how many times you call 
it. Also, a pure function does not have side effects. For example, even if you pass a 
mutable list to a pure function, the list will contain the same values before and after the 


function call. When you call the pure function sum, it does not modify its argument. 


lick here to view code image 





m Illi: values = Ii 2, 3] 

In [2]: sum(values) 

Oui PAE 6 

In [3]: sum(values) # same call always returns same result 
Omics Sal 6 

In [4]: values 

out ie (Gye e a] 


In the next chapter, we'll continue using functional-style programming concepts. Also, 


you'll see that functions are objects that you can pass to other functions as data. 


4.18 INTRO TO DATA SCIENCE: MEASURES OF 
DISPERSION 


In our discussion of descriptive statistics, we’ve considered the measures of central 
tendency—mean, median and mode. These help us categorize typical values in a group 
—such as the mean height of your classmates or the most frequently purchased car 


brand (the mode) in a given country. 


When we're talking about a group, the entire group is called the population. 

Sometimes a population is quite large, such as the people likely to vote in the next U.S. 
presidential election, which is a number in excess of 100,000,000 people. For practical 
reasons, the polling organizations trying to predict who will become the next president 
work with carefully selected small subsets of the population known as samples. Many 


of the polls in the 2016 election had sample sizes of about 1000 people. 


In this section, we continue discussing basic descriptive statistics. We introduce 
measures of dispersion (also called measures of variability) that help you 
understand how spread out the values are. For example, in a class of students, there 
may be a bunch of students whose height is close to the average, with smaller numbers 


of students who are considerably shorter or taller. 


For our purposes, we'll calculate each measure of dispersion both by hand and with 
functions from the module statistics, using the following population of 10 six-sided 


die rolls: 


Variance 


To determine the variance, ° we begin with the mean of these values—3 . 5. You 
obtain this result by dividing the sum of the face values, 35, by the number of rolls, 10. 


Next, we subtract the mean from every die value (this produces some negative results): 


° For simplicity, were calculating the population variance. There is a subtle difference 
between the population variance and the sample variance. Instead of dividing by n 
(the number of die rolls in our example), sample variance divides by n 1. The difference 
is pronounced for small samples and becomes insignificant as the sample size 
increases. The statistics module provides the functions pvariance and 
variance to calculate the population variance and sample variance, respectively. 
Similarly, the statistics module provides the functions pstdev and stdev to 
calculate the population standard deviation and sample standard deviation, 


respectively. 


Then, we square each of these results (yielding only positives): 


Cr 25 Ol Oy 025r AO Ora Oyen Oi Ol AO 0k25, 2:257 2425 


Finally, we calculate the mean of these squares, which is 2.25 (22.5 / 10)—this is 
the population variance. Squaring the difference between each die value and the 
mean of all die values emphasizes outliers—the values that are farthest from the 
mean. As we get deeper into data analytics, sometimes we'll want to pay careful 
attention to outliers, and sometimes we'll want to ignore them. The following code uses 


the statistics module’s pvariance function to confirm our manual result: 


lick here to view code image 


ny iis import statisties 


Ta 2l statisties pvaciancan Pin A R T O eee a 
Gurler 2729 


Standard Deviation 


The standard deviation is the square root of the variance (in this case, 1 . 5), which 


tones down the effect of the outliers. The smaller the variance and standard deviation 


are, the closer the data values are to the mean and the less overall dispersion (that is, 
spread) there is between the values and the mean. The following code calculates the 
population standard deviation with the statistics module’s pstdev function, 


confirming our manual result: 


lick here to view code image 


Ta Sule Sheeeslche nor; pocdev t ia len oye oye Sie URE oye 2n 
Omit (Sik lS 


Passing the pvariance function’s result to the math module’s sqrt function confirms 


our result of 1.5: 


lick here to view code image 


In [4]: import math 


ine Sk mati. sqrt (Statistics pvyarcriance tii Sree 2 Or” Sin op ye SS eee) 
Omit Lous Slee 


Advantage of Population Standard Deviation vs. Population Variance 


Suppose you've recorded the March Fahrenheit temperatures in your area. You might 
have 31 numbers such as 19, 32, 28 and 35. The units for these numbers are degrees. 

When you square your temperatures to calculate the population variance, the units of 
the population variance become “degrees squared.” When you take the square root of 
the population variance to calculate the population standard deviation, the units once 


again become degrees, which are the same units as your temperatures. 


4.19 WRAP-UP 


In this chapter, we created custom functions. We imported capabilities from the 
random and math modules. We introduced random-number generation and used it to 
simulate rolling a six-sided die. We packed multiple values into tuples to return more 
than one value from a function. We also unpacked a tuple to access its values. We 
discussed using the Python Standard Library’s modules to avoid “reinventing the 
wheel.” 


We created functions with default parameter values and called functions with keyword 
arguments. We also defined functions with arbitrary argument lists. We called methods 


of objects. We discussed how an identifier’s scope determines where in your program 


ou can use it. 


We presented more about importing modules. You saw that arguments are passed-by- 
reference to functions, and how the function-call stack and stack frames support the 
function-call-and-return mechanism. We also presented a recursive function and began 
introducing Python’s functional-style programming capabilities. We’ve introduced 
basic list and tuple capabilities over the last two chapters—in the next chapter, we'll 
discuss them in detail. 


Finally, we continued our discussion of descriptive statistics by introducing measures of 
dispersion—variance and standard deviation—and calculating them with functions 


from the Python Standard Library’s statistics module. 


For some types of problems, it’s useful to have functions call themselves. A recursive 


function calls itself, either directly or indirectly through another function. 


5. Sequences: Lists and Tuples 


Objectives 

In this chapter, you'll: 

mw Create and initialize lists and tuples. 

m Refer to elements of lists, tuples and strings. 

m Sort and search lists, and search tuples. 

m Pass lists and tuples to functions and methods. 


m Use list methods to perform common manipulations, such as searching for items, 


sorting a list, inserting items and removing items. 


m Use additional Python functional-style programming capabilities, including lambdas 


and the functional-style programming operations filter, map and reduce. 


m Use functional-style list comprehensions to create lists quickly and easily, and use 


generator expressions to generate values on demand. 
m Use two-dimensional lists. 


mw Enhance your analysis and presentation skills with the Seaborn and Matplotlib 


visualization libraries. 


Outline 
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5.1 INTRODUCTION 


In the last two chapters, we briefly introduced the list and tuple sequence types for 
representing ordered collections of items. Collections are prepackaged data structures 
consisting of related data items. Examples of collections include your favorite songs on 
your smartphone, your contacts list, a library’s books, your cards in a card game, your 
favorite sports team’s players, the stocks in an investment portfolio, patients in a cancer 


study and a shopping list. Python’s built-in collections enable you to store and access 


data conveniently and efficiently. In this chapter, we discuss lists and tuples in more 
detail. 


We'll demonstrate common list and tuple manipulations. You'll see that lists (which are 
modifiable) and tuples (which are not) have many common capabilities. Each can hold 
items of the same or different types. Lists can dynamically resize as necessary, 
growing and shrinking at execution time. We discuss one-dimensional and two- 


dimensional lists. 


In the preceding chapter, we demonstrated random-number generation and simulated 
rolling a six-sided die. We conclude this chapter with our next Intro to Data Science 
section, which uses the visualization libraries Seaborn and Matplotlib to interactively 
develop static bar charts containing the die frequencies. In the next chapter’s Intro to 
Data Science section, we'll present an animated visualization in which the bar chart 
changes dynamically as the number of die rolls increases—you'l see the law of large 


numbers “in action.” 


5.2 LISTS 


Here, we discuss lists in more detail and explain how to refer to particular list 


elements. Many of the capabilities shown in this section apply to all sequence types. 


Creating a List 


Lists typically store homogeneous data, that is, values of the same data type. 


Consider the list c, which contains five integer elements: 


lick here to view code image 


iwm alee Ay e O 725, LAS] 
ToN E 
Outl2l: [45,7 6, 0, 72, 15434 


They also may store heterogeneous data, that is, data of many different types. For 


example, the following list contains a student’s first name (a string), last name (a 





string), grade point average (a float) and graduation year (an int): 


lick here to view code image 


[Marya a ome amore 022 | 


Accessing Elements of a List 


You reference a list element by writing the list’s name followed by the element’s index 
(that is, its position number) enclosed in square brackets ([ ], known as the 
subscription operator). The following diagram shows the list c labeled with its 


element names: 


Position number (2) of this 
element within the sequence 


Names of the 


ice ee cH] c] eB] cp] 


Values of the 
list’s elements 


The first element in a list has the index 0. So, in the five-element list c, the first element 


is named c[0] andthe last is c [4]: 


Toake ess] 
Ourol: 45 


TAMATE eA] 
oute WES As 


Determining a List’s Length 


To get a list’s length, use the built-in len function: 


ta [Si Tente) 
Outsole 5 


Accessing Elements from the End of the List with Negative Indices 


Lists also can be accessed from the end by using negative indices: 


Element names 


with positive indices 7 © £1 c[1] c[2] c[3] c[4] 


c[-5] c[-4] c[-3] c[-2] c[-1] Element names 


with negative indicies 


So, list c’s last element (c [4] ), can be accessed with c [-1] and its first element with 


[=S]: 


ta lel eiL] 
utle: 1543 


iis (Al ESE] 
Outils =45 


Indices Must Be Integers or Integer Expressions 


An index must be an integer or integer expression (or a slice, as we'll soon see): 


ine | (WON elka t 6] 
Out LOW ss 72 


Using a non-integer index value causes a TypeError. 


Lists Are Mutable 


Lists are mutable—their elements can be modified: 
lick here to view code image 


Ee (seein Se 


ta MA2 e 
our 2e EAS o O ee T] 


You'll soon see that you also can insert and delete elements, changing the list’s length. 


Some Sequences Are Immutable 


Python’s string and tuple sequences are immutable—they cannot be modified. You can 
get the individual characters in a string, but attempting to assign a new value to one of 


the characters causes a TypeError: 
lick here to view code image 
MMS Sea hie lice! 


tTa [SEAR STO] 
Curae nat 


TypeError Traceback (most recent call last 
<ipython-input-15-812ef2514689> in <module>() 
Sa SON Wisi’ 





TypeError: 'str' object does not support item assignment 


A ë ť > 








ttempting to Access a Nonexistent Element 


Using an out-of-range list, tuple or string index causes an IndexError: 


lick here to view code image 





IndexError Traceback (most recent call last 
ipython-input-16-9a3lealelal3> in <module>() 
----> 1 c[100] 





Indextrror: list index out of range 


-A 86) 








Using List Elements in Expressions 


List elements may be used as variables in expressions: 


In is elol Beli F eizi 
Oue Ale =39 


Appending to a List with += 


Let’s start with an empty list [], then use a for statement and += to append the values 





1 through 5 to the list—the list grows dynamically to accommodate each item: 


lick here to view code image 


Talee ey Tse ai 


En ois tor number in rangai, 6): 


a list t= [number] 


TAR PAO a ESE 
cut 20k Vike se in Aa Sil 


When the left operand of += is a list, the right operand must be an iterable; otherwise, a 
TypeError occurs. In snippet [19]’s suite, the square brackets around number create 
a one-element list, which we append to a_list. Ifthe right operand contains multiple 
elements, += appends them all. The following appends the characters of 'Python' to 


the list letters: 


lick here to view code image 


im [20] leteers: — I] 
In [22]: letters += 'Python' 


Ta [23]; letters 
Out (PSIe: pip, ya n GTE Kone Aa E] 








If the right operand of += is a tuple, its elements also are appended to the list. Later in 


the chapter, we'll use the list method append to add items to a list. 


Concatenating Lists with + 


You can concatenate two lists, two tuples or two strings using the + operator. The 
result is a new sequence of the same type containing the left operand’s elements 


followed by the right operand’s elements. The original sequences are unchanged: 


lick here to view code image 


w eA: aks tle — n 201) 20] 
eae eA Ss cee srt? — es [PA S0] 
in PAG econcarcenacedi listi = list iir iSt? 


in l2: concavenated list 
Out l2n ik T0720; 207 407 50] 





A TypeError occurs if the + operator’s operands are difference sequence types—for 


example, concatenating a list and a tuple is an error. 


Using for and range to Access List Indices and Values 


List elements also can be accessed via their indices and the subscription operator ([ ] ): 


lick here to view code image 


Ta LEZ. Salles 


10 
20 
30 
40 
50 


He G0) NOs Es S 


for i in range (len (concatenated list): 


prine (EAn: concatenated listtiih t) 


The function call range (len (concatenated list) ) produces a sequence of 


integers representing concatenated list’s indices (in this case, 0 through 4). When 


looping in this manner, you must ensure that indices remain in range. Soon, well show 


a safer way to access element indices and values using built-in function enumerate. 


Comparison Operators 


You can compare entire lists element-by-element using comparison operators: 


lick here to view code image 














m 2 Oui a = i 2 3] 

ToM e= a a] 

In Esans CS [pal 2 Shr 4] 

In [32]: a == # True: corresponding elements in both are equal 

Out [32]: True 

In [33]: a == c # False: a and c have different elements and lengths 

Out [33]: False 

im [343 a e 4 Trues a has fewer elements than c 

Out [34]: True 

En, [352 e S= ib 4 Prue: elements 0-2 are equal but c has more elements 

Cut Soa); True 
4 W > 
5.3 TUPLES 


As discussed in the preceding chapter, tuples are immutable and typically store 


heterogeneous data, but the data can be homogeneous. A tuple’s length is its number of 


elements and cannot change during program execution. 


Creating Tuples 


o create an empty tuple, use empty parentheses: 


In [1]: student tuple = () 


bn (2s student tuple 
owk 


bn PS sLeni(student tuple) 
Outlok To 





Recall that you can pack a tuple by separating its values with commas: 


lick here to view code image 


tane student tuple = “doh. "Green', 3.3 


in, [Sk student cupile 
Oui Pou (Jonni; Greeni; 273) 


in Pods ren(student tuúuple) 
out lel: 3 





When you output a tuple, Python always displays its contents in parentheses. You may 


surround a tuple’s comma-separated list of values with optional parentheses: 


lick here to view code image 


Init another student tuple = (Mary; WESC oat) 


im [isi another student tuple 
Oüelslk (Maryi; “Red, 323) 


The following code creates a one-element tuple: 


lick here to view code image 


TA Sik a singleton tuple =- (tred) # note the comma 


in Vous a singleton tuple 
Our Loe n reduys) 


The comma (, ) that follows the string 'red' identifiesa_singleton tuple asa 


tuple—the parentheses are optional. If the comma were omitted, the parentheses would 


be redundant, anda singleton tuple would simply refer to the string 'red' 


rather than a tuple. 


Accessing Tuple Elements 


A tuple’s elements, though related, are often of multiple types. Usually, you do not 
iterate over them. Rather, you access each individually. Like list indices, tuple indices 
start at o. The following code creates time tuple representing an hour, minute and 
second, displays the tuple, then uses its elements to calculate the number of seconds 
since midnight—note that we perform a different operation with each value in the 


tuple: 
lick here to view code image 


in Pies eames tuple (an Ten 


Pm i2: time tuple 
Our [ales (O re 15 








moele ermescuplel ols 3600 m time tuple iii 60 m time cuplelz] 
Out Ls i 33361 





Assigning a value to a tuple element causes a TypeError. 


Adding Items to a String or Tuple 


As with lists, the += augmented assignment statement can be used with strings and 
tuples, even though they're immutable. In the following code, after the two 


assignments, tuplel and tuple2 refer to the same tuple object: 


nae An supe l= (nO 20 30) 
In [15]: tuple2 = tuplel 


rn [lel ruplez 
out rea: Clore ZR 30) 








Concatenating the tuple (40, 50) to tuplel creates a new tuple, then assigns a 


reference to it to the variable tuple1—tuple2 still refers to the original tuple: 


lick here to view code image 


ra e tuplel t= (40, 50) 


In iel: tupler 
Ont E L A N] 


Ta [Seo sero dke 2 


out LST (602030) 


For a string or tuple, the item to the right of += must be a string or tuple, respectively— 


mixing types causes a TypeError. 


Appending Tuples to Lists 


You can use += to append a tuple to a list: 


lick here to view code image 


Im [20]; numbers = li; 2, 3, 4, 5] 
In [21]: numbers += (6, 7) 


In [22]: numbers 
Oe Pas Th, 2. 37 4, Sr 6, V1 





Tuples May Contain Mutable Objects 


Let’s create a student tuple with a first name, last name and list of grades: 


lick here to view code image 





In ViZ2sie student. tuple — "Amanda; VBiuer 9G. on ook) 


Even though the tuple is immutable, its list element is mutable: 


lick here to view code image 


ine RA: student tupite] EnS 


T2]: student tuple 
ouel ol: (Amanda, "Blue", [937 g5; S7) 





In the double-subscripted name student _ tuple [2] [1], Python views 
student tuple[2] as the element of the tuple containing the list [98, 75, 87], 


then uses [1] to access the list element containing 75. The assignment in snippet [24] 


replaces that grade with 85. 


5.4 UNPACKING SEQUENCES 


The previous chapter introduced tuple unpacking. You can unpack any sequence’s 


elements by assigning the sequence to a comma-separated list of variables. A 


ValueError occurs if the number of variables to the left of the assignment symbol is 


not identical to the number of elements in the sequence on the right: 


lick here to view code image 


in [lle student tuple = (Amanda, [ 98; S57 STIN 
in 27 tirst name, grades =- student tuple 

Tale first name 

Out[3]: Amanda: 

In [4]: grades 

Out tA [98, 85, 87] 





The following code unpacks a string, a list and a sequence produced by range: 


lick here to view code image 


in (Si first, second = Wiha" 

TA 6l: PEINE ETESE] {second} ') 

mi 

In [7]: numberl, number2, number3 = [2, 3, 5] 

in lel: print (f" {number} {number2} {number3}') 
235 

In [9]: numberl, number2, number3 = range(10, 40, 10) 
TA OWES print (f unumberi] {number2} {number3}') 
PORTAO O 


Swapping Values Via Packing and Unpacking 


You can swap two variables’ values using sequence packing and unpacking: 


lick here to view code image 


In [11]: numberl = 99 


In [12]: number? = 2:2 





In [13]: numberl, number2 = (number2, number1) 
In [14]: print(f'numberl = {numberl}; number2 = {number2}') 
numberl = 22; number2 = 99 


Accessing Indices and Values Safely with Built-in Function enumerate 


Earlier, we called range to produce a sequence of index values, then accessed list 
elements in a for loop using the index values and the subscription operator ([ ] ). This 
is error-prone because you could pass the wrong arguments to range. If any value 
produced by range is an out-of-bounds index, using it as an index causes an 


IndexError. 


The preferred mechanism for accessing an element’s index and value is the built-in 
function enumerate. This function receives an iterable and creates an iterator that, for 
each element, returns a tuple containing the element’s index and value. The following 


code uses the built-in function list to create a list containing enumerate’s results: 


lick here to view code image 


In Sie colors, — Teed! “orange’,; yellow] 
In [16]: list(enumerate (colors) ) 
OU T0 redt), (1; Morange); (275) yeldiow™ i] 


Similarly the built-in function tuple creates a tuple from a sequence: 


lick here to view code image 


In [17]: tuple (enumerate (colors) ) 
Ouse VG, teed!) C; orange) (27 Lye rTown) 


The following for loop unpacks each tuple returned by enumerate into the variables 


index and value and displays them: 


lick here to view code image 





In [18]: for index, value in enumerate (colors): 


print (f'{index}: {value}') 
0: red 


1: orange 


2: yellow 


Creating a Primitive Bar Chart 


The following script creates a primitive bar chart where each bar’s length is made of 
asterisks (*) and is proportional to the list’s corresponding element value. We use the 
function enumerate to access the list’s indices and values safely. To run this example, 
change to this chapter’s ch05 examples folder, then enter: 

python figos O01. py 
or, if you’re in [Python already, use the command: 

eum £1cg 05. TONEY 


lick here to view code image 


# fig05 Ol.py 
TMD playanc a bar chart 1 
numbers = M19 27 15; 77 11] 


print (E! Tndex{"Value™:>8) Bar") 


for index, value in numerate (numbers): 





I 
2 
3 
4 
5 print ("\n€reating a bar chart from numbers: ") 
6 
7 
8 
9 


prine {andex > Si) (value: > si} PEE hee value}') 


lick here to view code image 


Creating a bar chart from numbers: 


Index Value Bar 


0 19 SRR CON, Aca A Ri AAG ROR RA OAS 
S aulai 
lay KK OK AOR RGA Ame ee 
7 KKKKKKK 


ileal KKKKKKKKKKK 








The for statement uses enumerate to get each element’s index and value, then 


displays a formatted line containing the index, the element value and the corresponding 


bar of asterisks. The expression 
e eEG 


creates a string consisting of value asterisks. When used with a sequence, the 
multiplication operator (*) repeats the sequence—in this case, the string "*"—value 
times. Later in this chapter, we’ll use the open-source Seaborn and Matplotlib libraries 


to display a publication--quality bar chart visualization. 


5.5 SEQUENCE SLICING 


You can slice sequences to create new sequences of the same type containing subsets of 
the original elements. Slice operations can modify mutable sequences—those that do 


not modify a sequence work identically for lists, tuples and strings. 


Specifying a Slice with Starting and Ending Indices 


Let’s create a slice consisting of the elements at indices 2 through 5 of a list: 


lick here to view code image 


iil (Lili sekbainexsacishe— [025 Sn te ye ie bebe agree Loa 
in, [24 mumber sii 26) 
Ouie (Ae a S y aki. ash 


The slice copies elements from the starting index to the left of the colon (2) up to, but 
not including, the ending index to the right of the colon (6). The original list is not 


modified. 


Specifying a Slice with Only an Ending Index 


If you omit the starting index, 0 is assumed. So, the slice numbers [: 6] is equivalent to 


the slice numbers [0:6]: 


lick here to view code image 


in [Sj] numbers [36] 
Oueleiks 125. 37 5r l 117 131 


In [4]: numbers[0:6] 


Ojehes AeA sir ei i alles als) 


Specifying a Slice with Only a Starting Index 


If you omit the ending index, Python assumes the sequence’s length (8 here), so snippet 


[5]’s slice contains the elements of numbers at indices 6 and 7: 


lick here to view code image 


In [5]: numbers[6:] 

Out ESE ie 9 

In [6]: numbers[6:len(numbers) ] 
Cur Gd eh 


Specifying a Slice with No Indices 


Omitting both the start and end indices copies the entire sequence: 


lick here to view code image 


Im [7]: numbers [2 i 
OUI IRA Sia Sie Wa ali asi. alge beh 


Though slices create new objects, slices make shallow copies of the elements—that is, 
they copy the elements’ references but not the objects they point to. So, in the snippet 
above, the new list’s elements refer to the same objects as the original list’s elements, 
rather than to separate copies. In the “Array-Oriented Programming with NumPy” 
chapter, we'll explain deep copying, which actually copies the referenced objects 


themselves, and we'll point out when deep copying is preferred. 


Slicing with Steps 


The following code uses a step of 2 to create a slice with every other element of 


numbers: 


ta [8]: numbers: 32] 
Gurke ey Sr ST] 


We omitted the start and end indices, so 0 and len (numbers) are assumed, 


respectively. 


Slicing with Negative Indices and Steps 


You can use a negative step to select slices in reverse order. The following code 
concisely creates a new list in reverse order: 


lick here to view code image 


in [Sis mumbers | 1] 
Curto sas cl asks Gili Wie ei Sis 2 


This is equivalent to: 


lick here to view code image 


Tta [10]: mumbercis)[—is—9— 1 | 
Oui onk Eo a Lr Dei a 


Modifying Lists Via Slices 


You can modify a list by assigning to a slice of it—the rest of the list is unchanged. The 


following code replaces numbers’ first three elements, leaving the rest unchanged: 


lick here to view code image 


Im il: numbers [033] = ['two", "*three", "“five’] 


Tani: numbers 


Orne A a Meyer! eE ERNS Sie a S lie S] 


The following deletes only the first three elements of numbers by assigning an empty 
list to the three-element slice: 


lick here to view code image 


In [13]: numbers[0:3] = [] 


In [14]: numbers 


Owes AS Giese aS) ale Sie aly o 


The following assigns a list’s elements to a slice of every other element of numbers: 


lick here to view code image 





in VMS numbers = 2; 2 D7 m dik, ws 17 r9] 
in Less numbers: 2 = L00 00 100 L00] 

In [17]: numbers 

Out LAG a OO Si ONO A KONO Sy MONO ro] 

In [18]: id(numbers) 

Out [18]: 4434456648 


Let’s delete all the elements in numbers, leaving the existing list empty: 


lick here to view code image 





In [19]: numbers[:] = [] 
In [20]: numbers 

OueLZ0) I] 

in [21)2 1d (numbers) 

Out [21]: 4434456648 


Deleting numbers’ contents (snippet [19] ) is different from assigning numbers a new 
empty list [] (snippet [22]). To prove this, we display numbers’ identity after each 


operation. The identities are different, so they represent separate objects in memory: 





In [22]2 numbers: = [] 
In [23]: numbers 
Ouel i hl 

In [24]: id(numbers) 
Out [24]: 4406030920 


When you assign a new object to a variable (as in snippet [21] ), the original object will 


be garbage collected if no other variables refer to it. 


5.6 DEL STATEMENT 


The del statement also can be used to remove elements from a list and to delete 
variables from the interactive session. You can remove the element at any valid index or 


the element(s) from any valid slice. 


Deleting the Element at a Specific List Index 


et’s create a list, then use del to remove its last element: 


lick here to view code image 


In [1]: numbers = list(range(0, KONS) 
In [2]: numbers 

Geuka e O al eee e eis o Wire teh e] 
In [3]: del numbers |[=1] 

In [4]: numbers 

oue LAS Oe ih a Sin NEE Ge Nea a woul 





Deleting a Slice from a List 


The following deletes the list’s first two elements: 
In [5]: del numbers[0:2] 
In [6]: numbers 


Ouelkole 27 37 47 Sr Cn i 3] 


The following uses a step in the slice to delete every other element from the entire list: 


In [7]: del numbers[::2] 
In [8]: numbers 
owe ked 3 Sr 7i 


Deleting a Slice Representing the Entire List 


The following code deletes all of the list’s elements: 


In [9]: del numbers[:] 
In [10]: numbers 
Outro El 


Deleting a Variable from the Current Session 


The del statement can delete any variable. Let’s delete numbers from the interactive 


session, then attempt to display the variable’s value, causing a NameError: 


lick here to view code image 


In [11]: del numbers 
bia (lle || numbers 
NameError Traceback (most recent call last 





ipython-input-12-426f8401232b> in <module>() 


----> 1 numbers 





NameError: name 'numbers' is not defined 














5.7 PASSING LISTS TO FUNCTIONS 


In the last chapter, we mentioned that all objects are passed by reference and 
demonstrated passing an immutable object as a function argument. Here, we discuss 
references further by examining what happens when a program passes a mutable list 


object to a function. 


Passing an Entire List to a Function 





Consider the function modify elements, which receives a reference to a list and 


multiplies each of the list’s element values by 2: 


lick here to view code image 


in [is der modasny element sitirtems) 
TEU MG ical aes: all element values in items by 2."™"™" 


for i in range (len (items)): 





items[i] *= 2 
iv numbers = On > y 17 9] 
mmie]: modity elements (Numbers) 
In [4]: numbers 
outas T20 Soir a2; e] 








Function modify elements’ items parameter receives a reference to the original 


list, so the statement in the loop’s suite modifies each element in the original list object. 


Passing a Tuple to a Function 


When you pass a tuple to a function, attempting to modify the tuple’s immutable 


elements results in a TypeError: 


lick here to view code image 





in) [5] numbers tuple = (10 20,30) 

In [6]: numbers tuple 

one |G SMO SAO 0) 

In [7]: modify elements (numbers tuple) 

woe 9 i ee a 





ipython-input-7-9339741cd595> in <module>() 


= > J) modify elements (numbers tuple) 


<ipython-input-1-27acb8f8f44c> in modify elements (items) 








2 Te Mal pes all element values in items by 2.""" 
3 for i in range(len(items)): 
=A items[i] *= 2 
5 
6 
TypeError: “tuple” object does not support item assignment 
4 | > 








Recall that tuples may contain mutable objects, such as lists. Those objects still can be 


modified when a tuple is passed to a function. 


A Note Regarding Tracebacks 


The previous traceback shows the two snippets that led to the TypeError. The first is 
snippet [7]’s function call. The second is snippet [1]’s function definition. Line 
numbers precede each snippet’s code. We’ve demonstrated mostly single-line snippets. 
When an exception occurs in such a snippet, it’s always preceded by ----> 1, 


indicating that line 1 (the snippet’s only line) caused the exception. Multiline snippets 





like the definition of modify elements show consecutive line numbers starting at 1. 


The notation ----> 4 above indicates that the exception occurred in line 4 of 





modify elements. No matter how long the traceback is, the last line of code with --- 


-> caused the exception. 


5.8 SORTING LISTS 


Sorting enables you to arrange data either in ascending or descending order. 


Sorting a List in Ascending Order 


List method sort modifies a list to arrange its elements in ascending order: 


lick here to view code image 


ta [als numbers = TL0 Ss m 17 9, 47 2, 87 57 6] 
In [2]: numbers.sort() 

In l9]: numbers 

Ouelsits [hh 2, 3,4, 5, 6, 7, 8, 9 10) 





Sorting a List in Descending Order 


To sort a list in descending order, call list method sort with the optional keyword 


argument reverse- set to True (False is the default): 


lick here to view code image 


In [4]: numbers.sort (reverse=True) 


In [5]: numbers 
Ome PSs TLO 9 87 lr 6r 5r 4A 37 27 H 


Built-In Function sorted 


Built-in function sorted returns a new list containing the sorted elements of its 
argument sequence—the original sequence is unmodified. The following code 


demonstrates function sorted for a list, a string and a tuple: 


lick here to view code image 


Takele numbers e a rO Sie a Aa O “oil 
In [7]: ascending numbers = sorted (numbers) 

In [8]: ascending numbers 

Out LSI Ph, 27 37 4, 57 On l 387 97 Ol 

In [9]: numbers 

Out SS Osis Yaa cil ache le 2, S cole o] 

In [10]: letters = 'fadgchjebi' 





ine ee ascending letters = sorted (letters) 


in [2s ascending letters 
Out eA ait prany Hona Coum uau Be ERA Mig 7 Vat TA Ua | 


Ta [1S]: Letters 





utila]: “fadgehjebii' 

In [14]: colors = ("red", '‘orange', “‘yellow', 'green', 'blue'") 
in Pole ascending colors =- somved (colors) 

in Pols ascending Colors 

Outils: [*blue', 'green', ‘orange’, ‘red", 'yellow"] 

in oils colors 

Ouel: Credi, “orange”, ‘yellow’, “Green, bidue™) 


Use the optional keyword argument reverse with the value True to sort the elements 


in descending order. 


5.9 SEARCHING SEQUENCES 


Often, you'll want to determine whether a sequence (such as a list, tuple or string) 
contains a value that matches a particular key value. Searching is the process of 


locating a key. 


List Method index 


List method index takes as an argument a search key—the value to locate in the list— 
then searches through the list from index o and returns the index of the first element 


that matches the search key: 


lick here to view code image 


m (aks numbers = r 77 1, 4, 2, 8 9, 6] 
In [2]: numbers.index(5) 
Qurba 6 


A ValueError occurs if the value you're searching for is not in the list. 


Specifying the Starting Index of a Search 


Using method index’s optional arguments, you can search a subset of a list’s elements. 


You can use *= to multiply a sequence—that is, append a sequence to itself multiple 


times. After the following snippet, numbers contains two copies of the original list’s 


contents: 


lick here to view code image 


In [3]: numbers *= 2 
In [4]: numbers 
(eie E e a lle he, eel aire toe Si de, de en sei sot. ey 


The following code searches the updated list for the value 5 starting from index 7 and 


continuing through the end of the list: 


In [5]: numbers.index(5, 7) 
Oui ees A 


Specifying the Starting and Ending Indices of a Search 


Specifying the starting and ending indices causes index to search from the starting 
index up to but not including the ending index location. The call to index in snippet 
[5]: 


numbers.index(5, 7) 


assumes the length of numbers as its optional third argument and is equivalent to: 


lick here to view code image 


numbers.index(5, 7, len(numbers) ) 


The following looks for the value 7 in the range of elements with indices 0 through 3: 


lick here to view code image 


in ole numbers- index, 0r 4) 
Outikedi: A 


Operators inandnot in 


Operator in tests whether its right operand’s iterable contains the left operand’s value: 


In [7]: 1000 in numbers 
Out[7]: False 


Ton kS] S In numbers 


Out ksl: True 


Similarly, operator not in tests whether its right operand’s iterable does not contain 


the left operand’s value: 
In [9]: 1000 not in numbers 
Out Pol: Erue 


ta Lol: 5 noe Im numbers 
Out[10]: False 


Using Operator in to Preventa ValueError 
You can use the operator in to ensure that calls to method index do not result in 


ValueErrors for search keys that are not in the corresponding sequence: 


lick here to view code image 


In [11]: key = 1000 


in 2s if key in numbers: 
printe (i found {key} at index (numbers. index (search key) i) 
else: 
print(f'{key} not found') 


LOU Me: round 














Built-In Functions any and a11 


Sometimes you simply need to know whether any item in an iterable is True or 
whether all the items are True. The built-in function any returns True if any item in 
its iterable argument is True. The built-in function a11 returns True if all items in its 
iterable argument are True. Recall that nonzero values are True and O is False. Non- 
empty iterable objects also evaluate to True, whereas any empty iterable evaluates to 
False. Functions any and a11 are additional examples of internal iteration in 


functional-style programming. 


5.10 OTHER LIST METHODS 


Lists also have methods that add and remove elements. Consider the list 


color names: 


lick here to view code image 


i lit color names =~ | orange", Tyellowi, “oqreen")| 


Inserting an Element at a Specific List Index 


Method insert adds a new item at a specified index. The following inserts 'red' at 


index 0: 


lick here to view code image 


ta 2l color names insert (0 tredi) 
Tn [3]: color names 
OQuceIsi: [ red"; torange", yellow, “green ] 


Adding an Element to the End of a List 


You can add a new item to the end of a list with method append: 


lick here to view code image 


in [4:2 color names- append (C pluen) 
In [5]: color names 
Out[5]: I redi; 'orange', 'yellow', 'green', 'blue'] 


Adding All the Elements of a Sequence to the End of a List 


Use list method extend to add all the elements of another sequence to the end of a list: 


lick here to view code image 


ta lelki colori names extend (t indigo, aose tc) ) 
Im I7]: color names 
Out[7]: ['red', 'orange', 'yellow', 'green', "bluet, U Bigholse oy Vee violeti] 





4 | > 





This is the equivalent of using +=. The following code adds all the characters of a string 


then all the elements of a tuple to a list: 


lick here to view code image 


in | [Jz sample last = [] 
Pais = 9 tabe! 
In [10]: sample list.extend(s) 


Dn Pais sample list 
Que [aka ely pe lore! | 





im 2 = CL, a 3) 
in Sie sanple listrextend(t) 
In [14]: sample list 


Owe ee i er Vile tier a I 2 4h 


Rather than creating a temporary variable, like t, to store a tuple before appending it to 
a list, you might want to pass a tuple directly to extend. In this case, the tuple’s 


parentheses are required, because extend expects one iterable argument: 
lick here to view code image 


in [US| sample wis trexcendi((4, Be ENN) # note the extra parentheses 


mma lel: sample list 
Cue olk Tan Er e 2s 4 a a 


A TypeError occurs if you omit the required parentheses. 


Removing the First Occurrence of an Element in a List 


Method remove deletes the first element with a specified value—a ValueError occurs 


if remove’s argument is not in the list: 


lick here to view code image 


mo ecolor names semovien( "green ")) 





In irel: color names 


Ou[Lel red), lorange’,, "yellow; Vole, rindigot, 'violet"] 


Emptying a List 


To delete all the elements in a list, call method clear: 


moil]: color names elear() 
In 120]: color names 
Outi 2 onk anM 


This is the equivalent of the previously shown slice assignment 


color_names[:] = [] 


Counting the Number of Occurrences of an Item 


List method count searches for its argument and returns the number of times it is 


found: 


lick here to view code image 


TonDa responses = T2 Oise Aa i ie See a Say Sy 

: Ip eg o 29 Cg a Sy o ae 4) 
hia (22 or aan rangel 6): 

print(f'{i} appears {responses.count(i)} times in responses' 

1 appears 3 times in responses 
2 appears 5 times in responses 
3 appears 8 times in responses 
4 appears 2 times in responses 
5 appears 2 times in responses 





a 





Reversing a List’s Elements 


List method reverse reverses the contents of a list in place, rather than creating a 


reversed copy, as we did with a slice previously: 


lick here to view code image 


Tms colori names = redi Orangen, “yellow, loreen T biue™] 


in [24°]; color names reverse ()) 


ine ZS) colorenames 


Out Z5]s [™blue", "green", ‘yellow’, ‘orange’, “xed)"] 


Copying a List 


List method copy returns a new list containing a shallow copy of the original list: 


lick here to view code image 


inv [Zo copied list = color mamesmcopy () 
ia ey En Coprecmilarst 
Ouc(2Z7 3s T bluet; “green”, yellow, ‘orange’, redi] 


This is equivalent to the previously demonstrated slice operation: 


copied list =- color names] 


5.11 SIMULATING STACKS WITH LISTS 


The preceding chapter introduced the function-call stack. Python does not have a built- 
in stack type, but you can think of a stack as a constrained list. You push using list 
method appena, which adds a new element to the end of the list. You pop using list 
method pop with no arguments, which removes and returns the item at the end of the 
list. 


Let’s create an empty list called stack, push (append) two strings onto it, then pop 


the strings to confirm they’re retrieved in last-in, first-out (LIFO) order: 


lick here to view code image 


amie stack =f] 
In [2]: stack.append('red') 


Tta isk Stack 
Outil: reda] 


In [4]: stack.append('green') 


ra) [Rous Stack 





Out[5]: ['red', 'green'] 


tallel: stack popi) 
Out[6]: "green! 


haw [Wks Stak 
Oui! ls [i redat] 


in PES: stacki popi) 
Oui [eal ted!" 


Tta I9]: stack 
Outil m] 





ta [10]: stack. pop) 





IndexError Traceback (most recent call last 
<ipython-input-10-50ea7ecl3fbe> in <module>() 





----> 1 stack.pop() 





IndexError: pop from empty LaS 




















or each pop snippet, the value that pop removes and returns is displayed. Popping 
from an empty stack causes an IndexError, just like accessing a nonexistent list 
element with []. To preventan IndexError, ensure that len (stack) is greater than 
0 before calling pop. You can run out of memory if you keep pushing items faster than 


you pop them. 


You also can use a list to simulate another popular collection called a queue in which 
you insert at the back and delete from the front. Items are retrieved from queues in 
first-in, first-out (FIFO) order. 


5.12 LIST COMPREHENSIONS 


Here, we continue discussing functional-style features with list comprehensions—a 
concise and convenient notation for creating new lists. List comprehensions can replace 


many for statements that iterate over existing sequences and create new lists, such as: 


lick here to view code image 
Ta Mie lastel =i] 
in (2) for rtem in range llr 6): 


listl.append(item) 


Eo [Sill SeN 


Using a List Comprehension to Create a List of Integers 


We can accomplish the same task in a single line of code with a list comprehension: 


lick here to view code image 


Ta AI last2- = [item for item an range, 6) 
Tn ESAS EESE 
Gwta be a a a 
Like snippet [2]’s for statement, the list comprehension’s for clause 


for item in range(1, 6) 


iterates over the sequence produced by range (1, 6). For each item, the list 
comprehension evaluates the expression to the left of the for clause and places the 
expression’s value (in this case, the i tem itself) in the new list. Snippet [4]’s particular 


comprehension could have been expressed more concisely using the function list: 


iist2 = Alnbtslies | (Getswalonen (lleaa 6N) 


Mapping: Performing Operations in a List Comprehension’s Expression 


A list comprehension’s expression can perform tasks, such as calculations, that map 
elements to new values (possibly of different types). Mapping is a common functional- 
style programming operation that produces a result with the same number of elements 
as the original data being mapped. The following comprehension maps each value to its 


cube with the expression item ** 3: 
lick here to view code image 
Ta leol: Lasts = [item ** 3 for item in mange (ly 6)] 


ite ile ESES 
Ou lee sis ee oA 2S] 


Filtering: List Comprehensions with i f Clauses 


Another common functional-style programming operation is filtering elements to 
select only those that match a condition. This typically produces a list with fewer 


elements than the data being filtered. To do this in a list comprehension, use the if 





clause. The following includes in 1ist4 only the even values produced by the for 


clause: 
lick here to view code image 
In [8]: list4 = [item for item in range(l1, 11) if item s 2 == 0] 


mae EEAS TEESE] 
Out oe An ei, tees 10] 


List Comprehension That Processes Another List’s Elements 





The for clause can process any iterable. Let’s create a list of lowercase strings and use a 


list comprehension to create a new list containing their uppercase versions: 


lick here to view code image 









































im VO: colors = | cred! “orange'; "yellow", "gresn! ‘blue ] 
in [tis scolors2 = atemsupper() for item in colors] 

dn [MEA Recolors? 

Ougli2 |= I RED! “ORANGE, “YERLOW', “GREEN, “BLUE ] 

in | [Sas colors 

Out[13]: ['red', 'orange', 'yellow', 'green', 'blue'] 


5.13 GENERATOR EXPRESSIONS 


A generator expression is similar to a list comprehension, but creates an iterable 
generator object that produces values on demand. This is known as lazy 
evaluation. List comprehensions use greedy evaluation—they create lists 
immediately when you execute them. For large numbers of items, creating a list can 
take substantial memory and time. So generator expressions can reduce your program’s 


memory consumption and improve performance if the whole list is not needed at once. 


Generator expressions have the same capabilities as list comprehensions, but you 
define them in parentheses instead of square brackets. The generator expression in 


snippet [2] squares and returns only the odd values in numbers: 


lick here to view code image 


awat EE “tehibhinleysusise = a Sine Wha. alae See A eee cree syn onl 


Tas l2]; for value in (x >: 2 for x in numbers ais x s 2 Z0); 


print (value, end=' AN) 


Ses g 25 


To show that a generator expression does not create a list, let’s assign the preceding 


snippet’s generator expression to a variable and evaluate the variable: 
lick here to view code image 


im) | 34h sguares of odds =" (x 4% <2 for pe ain numbers re x rs. a 1 10) 


taili saguares of odds 


Out[3]: <generator object <genexpr> at 0x1085e84c0> 





The text "generator object <genexpr>" indicates that square of odds isa 


generator object that was created from a generator expression (genexpr). 


5.14 FILTER, MAP AND REDUCE 


The preceding section introduced several functional-style features—list 





comprehensions, filtering and mapping. Here we demonstrate the built-in filter and 
map functions for filtering and mapping, respectively. We continue discussing 
reductions in which you process a collection of elements into a single value, such as 


their count, total, product, average, minimum or maximum. 


Filtering a Sequence’s Values with the Built-In fi 1ter Function 


Let’s use built-in function filter to obtain the odd values in numbers: 


lick here to view code image 


ta [its numbers = L0 3 7, 1, 9, 4, 2, 8, oS, 6] 


Pn [2]: def is odd(x): 
Ve Re burns True oniy ir x isc oog TTi 


Q 


return x 3 2 I= 0 


ia Lei ars een bere (acs odd numbers) 
Out kale lish ve al Se 5] 





Like data, Python functions are objects that you can assign to variables, pass to other 
functions and return from functions. Functions that receive other functions as 
arguments are a functional-style capability called higher-order functions. For 


example, filter’s first argument must be a function that receives one argument and 





returns True if the value should be included in the result. The function is_odd returns 
True if its argument is odd. The filter function calls is_odd once for each value in 
its second argument’s iterable (numbers). Higher-order functions may also return a 


function as a result. 


Function filter returns an iterator, so £ilter’s results are not produced until you 








iterate through them. This is another example of lazy evaluation. In snippet [3], 


function list iterates through the results and creates a list containing them. We can 





obtain the same results as above by using a list comprehension with an if clause: 


lick here to view code image 


in [413 [tem for icem in numbers ai ais odd(item)] 
Oui As Sire re 9 S5] 


Using a 1 ambda Rather than a Function 


For simple functions like is_odd that return only a single expression’s value, you can 
use a lambda expression (or simply a lambda) to define the function inline where 


it’s needed—typically as it’s passed to another function: 


lick here to view code image 


in PS Jes list(tilter (lambda z: x % 2 != 0, numbers)) 
Ou Sle 137 7 G 9 5] 





We pass filter’s return value (an iterator) to function 1ist here to convert the 





results to a list and display them. 


A lambda expression is an anonymous function—that is, a function without a name. In 


the filter call 





lick here to view code image 


filter(lambda x: x % 2 != 0, numbers) 


the first argument is the lambda 


A lambda begins with the lambda keyword followed by a comma-separated parameter 
list, a colon (: ) and an expression. In this case, the parameter list has one parameter 
named x. A lambda implicitly returns its expression’s value. So any simple function of 


the form 


lick here to view code image 


def function_name(parameter list): 


return expression 


may be expressed as a more concise lambda of the form 


lick here to view code image 


lambda parameter list: expression 


Mapping a Sequence’s Values to New Values 


Let’s use built-in function map with a lambda to square each value in numbers: 


lick here to view code image 


In [6]: numbers 

OME IMSS. ROES ee ile esha le ee Ale a cul 

Im [7] last (mapilambda x: x ** 2, numbers) 
Omics lk OO one OF ew souls kG ne Ay oA 25), Sir] 


Function map’s first argument is a function that receives one value and returns a new 
value—in this case, a lambda that squares its argument. The second argument is an 

iterable of values to map. Function map uses lazy evaluation. So, we pass to the list 
function the iterator that map returns. This enables us to iterate through and create a 


list of the mapped values. Here’s an equivalent list comprehension: 


lick here to view code image 


In [8]: [item ** 2 for item in numbers] 
Ome Sis ROO Oe AO ee Cullen er eA eo a2 Dies Or] 


Combining filter and map 


You can combine the preceding filter and map operations as follows: 


lick here to view code image 


Ton tole list (map (lambda x: x AA, 
Š filter (lambda EO 


ae 
No 
ll 


QO, numbers) ) ) 


hehe (Se Sh ees Ee elle ZE] 





There is a lot going on in snippet [9], so let’s take a closer look at it. First, filter 
returns an iterable representing only the odd values of numbers. Then map returns an 
iterable representing the squares of the filtered values. Finally, 1ist uses map’s iterable 
to create the list. You might prefer the following list comprehension to the preceding 


snippet: 


lick here to view code image 


ia VMOU se AN 2 for x n numbers If x = 2 = 0] 
OUR TMOl 19, 49; 17 si, 25] 


For each value of x in numbers, the expression x ** 2 is performed only if the 


condition x 3 2 != Qis True. 


Reduction: Totaling the Elements of a Sequence with sum 


As you know reductions process a sequence’s elements into a single value. You’ve 


performed reductions with the built-in functions len, sum, min and max. You also can 





create custom reductions using the functools module’s reduce function. See 
ttps://docs.python.org/3/library/functools.html for a code example. 

When we investigate big data and Hadoop in hapter 16, we'll demonstrate MapReduce 

programming, which is based on the filter, map and reduce operations in functional- 


style programming. 


5.15 OTHER SEQUENCE PROCESSING FUNCTIONS 


Python provides other built-in functions for manipulating sequences. 


Finding the Minimum and Maximum Values Using a Key Function 


We've previously shown the built-in reduction functions min and max using arguments, 
such as ints or lists of ints. Sometimes you'll need to find the minimum and 
maximum of more complex objects, such as strings. Consider the following 


comparison: 


im iI: “Redi < “orange 
out: True 


The letter 'R' “comes after” 'o' in the alphabet, so you might expect 'Red' to be less 
than 'orange' and the condition above to be False. However, strings are compared 
by their characters’ underlying numerical values, and lowercase letters have higher 
numerical values than uppercase letters. You can confirm this with built-in function 


ord, which returns the numerical value of a character: 


ta l2] orda (IRY) 
putzik 82 


Tane ord CON) 
Gut loa rT 


Consider the list colors, which contains strings with uppercase and lowercase letters: 


lick here to view code image 





In [4]: colors = ['"Red!, "orange, "Yellow", ‘green’, 'Blue™] 


Let’s assume that we’d like to determine the minimum and maximum strings using 
alphabetical order, not numerical (lexicographical) order. If we arrange colors 


alphabetically 


lick here to view code image 





"Blue', 'green', ‘orange', 'Red', "Yellow! 


you can see that 'Blue' is the minimum (that is, closest to the beginning of the 


alphabet), and 'Yellow' is the maximum (that is, closest to the end of the alphabet). 


Since Python compares strings using numerical values, you must first convert each 
string to all lowercase or all uppercase letters. Then their numerical values will also 
represent alphabetical ordering. The following snippets enable min and max to 


determine the minimum and maximum strings alphabetically: 


lick here to view code image 


In [5]: min(colors, key=lambda s: s.lower() ) 
One [Si] “Biwe 

In [6]: max(colors, key=lambda s: s.lower() ) 
Qutli: "Yellows 


The key keyword argument must be a one-parameter function that returns a value. In 
this case, it’s a lambda that calls string method lower to get a string’s lowercase 
version. Functions min and max call the key argument’s function for each element and 


use the results to compare the elements. 


Iterating Backward Through a Sequence 


Built-in function reversed returns an iterator that enables you to iterate over a 
sequence’s values backward. The following list comprehension creates a new list 


containing the squares of numbers’ values in reverse order: 


lick here to view code image 


awon (Dy ile <tevibhiglexsusisk = a Sin hae lee Sh e he roe eye toll 

in? [ij reversed numbers = [tem ror item in reversed(numbers) ] 
mm Sk reversed numbers 

Outiel: T367 25, 64 4 ke, ecules al A9 9, L00] 





Combining Iterables into Tuples of Corresponding Elements 


Built-in function zip enables you to iterate over multiple iterables of data at the same 
time. The function receives as arguments any number of iterables and returns an 
iterator that produces tuples containing the elements at the same index in each. For 
example, snippet [11]’s call to zip produces the tuples ('Bob', 3.5), ('Sue', 


4.0) and ('Amanda', 3.75) consisting of the elements at index 0, 1 and 2 of each 


list, respectively: 


lick here to view code image 


In [9]: names = ['Bob', 'Sue', 'Amanda'] 
m [Ole grade point averages = Il jaa;, LO, S213) 
In [Lil tor name, gpa in ‘zip(names, grade point averages) : 


print (f'Name={name}; GPA={gpa}') 
Name=Bob; GPA=3.5 


Name=Sue; GPA=4.0 
Name=Amanda; GPA=3.75 


We unpack each tuple into name and gpa and display them. Function zip’s shortest 


argument determines the number of tuples produced. Here both have the same length. 


5.16 TWO-DIMENSIONAL LISTS 


Lists can contain other lists as elements. A typical use of such nested (or 
multidimensional) lists is to represent tables of values consisting of information 
arranged in rows and columns. To identify a particular table element, we specify two 
indices—by convention, the first identifies the element’s row, the second the element’s 


column. 


Lists that require two indices to identify an element are called two-dimensional lists 
(or double-indexed lists or double-subscripted lists). Multidimensional lists can 


have more than two indices. Here, we introduce two-dimensional lists. 


Creating a Two-Dimensional List 


Consider a two-dimensional list with three rows and four columns (i.e., a 3-by-4 list) 


that might represent the grades of three students who each took four exams in a course: 


lick here to view code image 
In [1]: a = [[77, 68, 86, 73], [96, 87, 89, 81], [70, 90, 86, 81]] 


Writing the list as follows makes its row and column tabular structure clearer: 


lick here to view code image 


a > Ite, 68) 86,2 13), v first students grades 
[SG a 377 29, 817 # second student's grades 
PROPS SOUS S67 g1] # third student's grades 


Illustrating a Two-Dimensional List 


The diagram below shows the list a, with its rows and columns of exam grade values: 


Column 0 Column | Column 2 Column 3 


Row 0 


Row | 


Row 2 





Identifying the Elements in a Two-Dimensional List 


The following diagram shows the names of list a’s elements: 


Column 0 Column | Column 2 Column 3 


Row 0 


Row | 


Row 2 





Column index 
Row index 
List name 






Every element is identified by a name of the form a [i] [j] —a is the list’s name, andi 
and j are the indices that uniquely identify each element’s row and column, 
respectively. The element names in row o all have o as the first index. The element 


names in column 3 all have 3 as the second index. 


In the two-dimensional list a: 


e 77, 68, 86 and 73 initialize a[0] [0], a[0] [1], a[0] [2] anda[0] [3], 


respectively, 


e 96, 87, 89 and 81 initialize a[1][0],a[1][1],a[1] [2] anda[1] [3], 


respectively, and 


e 70, 90, 86 and 81 initialize a[2] [0], a[2] [1], a[2] [2] and a[2] [3], 


respectively. 


A list with m rows and n columns is called an m-by-n list and has m x n elements. 





The following nested for statement outputs the rows of the preceding two-dimensional 


list one row at a time: 


lick here to view code image 


iia, IVAN EOR TOv Talal Ve 
for item in row: 
print(item, end=' ') 
prant() 
gl eC Oe OOS) 
96 87, 89 eL 
TOMO 816." 34. 


How the Nested Loops Execute 


Let’s modify the nested loop to display the list’s name and the row and column indices 


and value of each element: 


lick here to view code image 


In [3]: for i, row in enumerate(a): 
for j, item in enumerate (row): 


printet aha) e emk 2 endim 





printe O) 
a[0][0]=77 a[0][1]=68 a[0][2]=86 a[0] [3]=73 
a[1][0]=96 a[1][1]=87 a[1][2]=89 a[1][3]=81 
a[2][0]=70 a[2][1]=90 a[2][2]=86 a[2] [3]=81 





The outer for statement iterates over the two-dimensional list’s rows one row ata 








time. During each iteration of the outer for statement, the inner for statement 
iterates over each column in the current row. So in the first iteration of the outer loop, 


row Ois 


and the nested loop iterates through this list’s four elements a [0] [0]=77, a[0] 
[1]=68, a[0] [2]=86 anda [0] [3]=73. 


In the second iteration of the outer loop, row 1 is 
[96, 87, 89, 81] 


and the nested loop iterates through this list’s four elements a [1] [0]=96, a[1] 
[1]=87, a[1] [2]=89 and a[1] [3]=81. 


In the third iteration of the outer loop, row 2 is 
[70, 90, 86, 81] 


and the nested loop iterates through this list’s four elements a [2] [0]=70, a[2] 
[1]=90, a[2] [2]=86 anda[2] [3]=81. 


In the “Array-Oriented Programming with NumPy” chapter, well cover the NumPy 
library’s ndarray collection and the Pandas library’s DataFrame collection. These 
enable you to manipulate multidimensional collections more concisely and 


conveniently than the two-dimensional list manipulations you’ve seen in this section. 


5.17 INTRO TO DATA SCIENCE: SIMULATION AND 
STATIC VISUALIZATIONS 


The last few chapters’ Intro to Data Science sections discussed basic descriptive 
statistics. Here, we focus on visualizations, which help you “get to know” your data. 
Visualizations give you a powerful way to understand data that goes beyond simply 


looking at raw data. 


We use two open-source visualization libraries—Seaborn and Matplotlib—to display 
static bar charts showing the final results of a six-sided-die-rolling simulation. The 
Seaborn visualization library is built over the Matplotlib visualization library 
and simplifies many Matplotlib operations. We'll use aspects of both libraries, because 
some of the Seaborn operations return objects from the Matplotlib library. In the next 
chapter’s Intro to Data Science section, we'll make things “come alive” with dynamic 


visualizations. 


5.17.1 Sample Graphs for 600, 60,000 and 6,000,000 Die Rolls 


he screen capture below shows a vertical bar chart that for 600 die rolls summarizes 
the frequencies with which each of the six faces appear, and their percentages of the 


total. Seaborn refers to this type of graph as a bar plot: 


Rolling a Six-Sided Die 600 Times 
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Here we expect about 100 occurrences of each die face. However, with such a small 





number of rolls, none of the frequencies is exactly 100 (though several are close) and 
most of the percentages are not close to 16.667% (about 1/6th). As we run the 
simulation for 60,000 die rolls, the bars will become much closer in size. At 6,000,000 
die rolls, they'll appear to be exactly the same size. This is the “ aw of large numbers” at 


work. The next chapter will show the lengths of the bars changing dynamically. 
We'll discuss how to control the plot’s appearance and contents, including: 

e the graph title inside the window (Rolling a Six-Sided Die 600 Times), 

e the descriptive labels Die Value for the x-axis and Frequency for the y-axis, 


e the text displayed above each bar, representing the frequency and percentage of the 


total rolls, and 


e the bar colors. 


We'll use various Seaborn default options. For example, Seaborn determines the text 
labels along the x-axis from the die face values 1—6 and the text labels along the y-axis 
from the actual die frequencies. Behind the scenes, Matplotlib determines the positions 
and sizes of the bars, based on the window size and the magnitudes of the values the 
bars represent. It also positions the Frequency axis’s numeric labels based on the 
actual die frequencies that the bars represent. There are many more features you can 


customize. You should tweak these attributes to your personal preferences. 


The first screen capture below shows the results for 60,000 die rolls—imagine trying to 
do this by hand. In this case, we expect about 10,000 of each face. The second screen 
capture below shows the results for 6,000,000 rolls—surely something you’d never do 
by hand! In this case, we expect about 1,000,000 of each face, and the frequency bars 
appear to be identical in length (they’re close but not exactly the same length). Note 
that with more die rolls, the frequency percentages are much closer to the expected 
16.667%. 
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5.17.2 Visualizing Die-Roll Frequencies and Percentages 


In this section, you'll interactively develop the bar plots shown in the preceding section. 


Launching IPython for Interactive Matplotlib Development 


IPython has built-in support for interactively developing Matplotlib graphs, which you 
also need to develop Seaborn graphs. Simply launch IPython with the command: 


ipython --matplotlib 


Importing the Libraries 


First, let’s import the libraries we'll use: 


lick here to view code image 


En [jf import matplotlub .pypiot as pit 
In [2]: import numpy as np 

In [Sl import random 

In [4]: import seaborn as sns 











1. The matplotlib.pyplot module contains the Matplotlib library’s graphing 
capabilities that we use. This module typically is imported with the name p1t. 


2. The NumPy (Numerical Python) library includes the function unique that we'll use 


to summarize the die rolls. The numpy module typically is imported as np. 
3. The random module contains Python’s random-number- generation functions. 


4. The seaborn module contains the Seaborn library’s graphing capabilities we use. 
This module typically is imported with the name sns. Search for why this curious 


abbreviation was chosen. 


Rolling the Die and Calculating Die Frequencies 


Next, let’s use a list comprehension to create a list of 600 random die values, then use 
NumPy’s unique function to determine the unique roll values (most likely all six 


possible face values) and their frequencies: 


lick here to view code image 


In [5]: rolls = [random.randrange (1, 7) for 1 in range (600) ] 


In [6]: values, frequencies = np.unique(rolls, return _counts=True) 





The NumPy library provides the high-performance ndarray collection, which is 
typically much faster than lists. * Though we do not use ndarray directly here, the 
NumPy unique function expects an ndarray argument and returns an ndarray. If 
you pass a list (like rolls), NumPy converts it to an ndarray for better performance. 
The ndarray that unique returns we'll simply assign to a variable for use by a 


Seaborn plotting function. 


* Well run a performance comparison in hapter 7 where we discuss ndarray in 


depth. 


Specifying the keyword argument return_counts=True tells unique to count each 
unique value’s number of occurrences. In this case, unique returns a tuple of two one- 
dimensional ndarrays containing the sorted unique values and the corresponding 
frequencies, respectively. We unpack the tuple’s ndarrays into the variables values 


and frequencies. If return counts is False, only the list of unique values is 





returned. 


Creating the Initial Bar Plot 


Let’s create the bar plot’s title, set its style, then graph the die faces and frequencies: 


lick here to view code image 


En vik tablet = FROITling a Six vided Die {len(rolls):,} Times! 





in bet sns -set Style (whitegreidi) 


In [9]: axes = sns.barplot (x=values, y=frequencies, palette='bright') 





Snippet [7]’s f-string includes the number of die rolls in the bar plot’s title. The comma 


(,) format specifier in 


{kren (rod Is) 37.) 


displays the number with thousands separators—so, 60000 would be displayed as 


60,000. 


By default, Seaborn plots graphs on a plain white background, but it provides several 


styles to choose from ('darkgrid', 'whitegrid', 'dark', 'white' and 





'ticks'). Snippet [8] specifies the 'whitegrid' style, which displays light-gray 
horizontal lines in the vertical bar plot. These help you see more easily how each bar’s 


height corresponds to the numeric frequency labels at the bar plot’s left side. 


Snippet [9] graphs the die frequencies using Seaborn’s barplot function. When you 
execute this snippet, the following window appears (because you launched IPython with 


the --matplotlib option): 
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Seaborn interacts with Matplotlib to display the bars by creating a Matplotlib Axes 
object, which manages the content that appears in the window. Behind the scenes, 
Seaborn uses a Matplotlib Figure object to manage the window in which the Axes will 
appear. Function barplot’s first two arguments are ndarrays containing the x-axis 
and y-axis values, respectively. We used the optional palette keyword argument to 
choose Seaborn’s predefined color palette 'bright'. You can view the palette options 


at: 
ttps://seaborn.pydata.org/tutorial/color palettes.html 


Function barplot returns the Axes object that it configured. We assign this to the 
variable axes so we can use it to configure other aspects of our final plot. Any changes 
you make to the bar plot after this point will appear immediately when you execute the 


corresponding snippet. 


Setting the Window Title and Labeling the x- and y-Axes 


The next two snippets add some descriptive text to the bar plot: 


lick here to view code image 


ins PLOW axeis Sie title(titie) 
Out [MO > text (0.5, 17 Rolling a. Stx- Sided Die 600 Times ) 


In [11]: axes.set(xlabel='Die Value', ylabel='Frequency') 
Out M: [ext (92.6667, 075, Ereguency N, Text (025, 58..7 667, Dre Value') ] 























4 | 





Snippet [10] uses the axes object’s set_title method to display the title string 
centered above the plot. This method returns a Text object containing the title and its 
location in the window, which IPython simply displays as output for confirmation. You 


can ignore the Out []s in the snippets above. 


Snippet [11] add labels to each axis. The set method receives keyword arguments for 
the Axes object’s properties to set. The method displays the x1 abel text along the x- 
axis, and the ylabe1- text along the y-axis, and returns a list of Text objects 


containing the labels and their locations. The bar plot now appears as follows: 


Rolling a Six-Sided Die 600 Times 
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Finalizing the Bar Plot 


The next two snippets complete the graph by making room for the text above each bar, 


then displaying it: 


lick here to view code image 


in 2] axess ylim(top-max (frequencies) * 1.10) 


Out PL2 ss COO. 12252100100: 0:010:010i0:0.0x8) 





In [13]: for bar, frequency in zip(axes.patches, frequencies): 
text x = bar.get_x() + bar.get width() / 2.0 
texti y = bar. cect wneng hei) 
text = f'{frequency:,}\n{frequency / len(rolls):.3%}' 
axes. text (text x, text y, text, 


fontsize=l1l1, ha='center', va='bottom') 


To make room for the text above the bars, snippet [12] scales the y-axis by 10%. We 
chose this value via experimentation. The Axes object’s set_y1lim method has many 
optional keyword arguments. Here, we use only top to change the maximum value 
represented by the y-axis. We multiplied the largest frequency by 1.10 to ensure that the 
y-axis is 10% taller than the tallest bar. 


Finally, snippet [13] displays each bar’s frequency value and percentage of the total 


rolls. The axes object’s patches collection contains two-dimensional colored shapes 





that represent the plot’s bars. The for statement uses zip to iterate through the 





patches and their corresponding frequency values. Each iteration unpacks into bar 








and frequency one of the tuples zip returns. The for statement’s suite operates as 


follows: 


e The first statement calculates the center x-coordinate where the text will appear. We 
calculate this as the sum of the bar’s left-edge x-coordinate (bar.get_x()) and 


half of the bar’s width (bar.get width() / 2.0). 


e The second statement gets the y-coordinate where the text will appear 


—bar.get_y() represents the bar’s top. 


e The third statement creates a two-line string containing that bar’s frequency and the 
corresponding percentage of the total die rolls. 


e The last statement calls the Axes object’s text method to display the text above the 
bar. This method’s first two arguments specify the text’s x—y position, and the third 
argument is the text to display. The keyword argument ha specifies the horizontal 
alignment—we centered text horizontally around the x-coordinate. The keyword 
argument va specifies the vertical alignment—we aligned the bottom of the text 


with at the y-coordinate. The final bar plot is shown below: 
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Rolling Again and Updating the Bar Plot—Introducing IPython Magics 


Now that you've created a nice bar plot, you probably want to try a different number of 
die rolls. First, clear the existing graph by calling Matplotlib’s cla (clear axes) function: 


In [aha plie eral 


IPython provides special commands called magics for conveniently performing 
various tasks. Let’s use the recall magic to get snippet [5], which created the 


rolls list, and place the code at the next In [] prompt: 


lick here to view code image 


ray Si sreca MiS 


In [16]: rolls = [random.randrange(l, 7 for i in range(600)] 


You can now edit the snippet to change the number of rolls to 60000, then press Enter 


to create a new list: 


lick here to view code image 


Im (Lo) rolls- prandomacandcange (1; D for a in range (60000y] 


Next, recall snippets [6] through [13]. This displays all the snippets in the specified 


range in the next In [] prompt. Press Enter to re-execute these snippets: 


lick here to view code image 


tma [Isles recall 6-153 





In [18]: values, frequencies = np.unique (rolls, return_counts=True) 
title = fl Rolling a Six Sided Die (len(rolls):,} Tames" 





SNS set silyl en wikia tas) 


axes = sns.barplot(x=values, y=frequencies, palette='bright') 





axese Seta uacleGentle) 


axes.set(xlabel='Die Value', ylabel='Frequency') 





axes. set yiliim(top=max(frequencmes)) * i. 110) 





for bar, frequency in zip(axes.patches, frequencies): 


text = Dor a e x(t baie sete warcichy (Gs a 





x 
text y barge height) 

text = f'{frequency:,}\n{frequency / len(rolls):.3%}' 
axes.text(text_x, text y; text, 


fontsize=11, ha='center', va='bottom') 





< N >» 





The updated bar plot is shown below: 
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Saving Snippets to a File with the save Magic 


Once you've interactively created a plot, you may want to save the code to a file so you 
can turn it into a script and run it in the future. Let’s use the save magic to save 
snippets 1 through 13 to a file named Rol 1Die.py. [Python indicates the file to which 


the lines were written, then displays the lines that it saved: 


lick here to view code image 


En) MS] ssave Rolie py. 1-713 

The following commands were written to file ~RollDie.py ’: 
importe matplot lib. pyplot as plit 

import numpy as np 

import random 

import seaborn as sns 


rolls = [random.randrange(1l, 7) for i in range(600) ] 





values; frequencies — np- -unigue (rolls, return counts—lrue) 
title = f'Rolling a Six-Sided Die {len(rolls):,} Times' 
sns set style (Uwmirtegrirat) 











axes = sns.barplot (values, frequencies, palette='bright') 


axes set title(titile) 





axes.set(xlabel='Die Value', ylabel='Frequency') 





axés..set_ylim(top=max (frequencies) * 1.10) 

for bar, frequency in zip(axes.patches, frequencies): 
ux = Dar.get z() a bər.get widtni) / 2-0 

texel y =~ bar:get heignht() 





text 





text f'{frequency:,}\n{frequency / len(rolls):.3%}!' 


axes.text(text_x, text_y, text, 


fontsize=11, ha='center', va='bottom') 


Command-Line Arguments; Displaying a Plot from a Script 


Provided with this chapter’s examples is an edited version of the Rol1Die. py file you 
saved above. We added comments and a two modifications so you can run the script 


with an argument that specifies the number of die rolls, as in: 


ipython RollDie.py 600 


The Python Standard Library’s sys module enables a script to receive command-line 
arguments that are passed into the program. These include the script’s name and any 
values that appear to the right of it when you execute the script. The sys module’s 
argv list contains the arguments. In the command above, argv [0] is the string 
"RollDie.py' and argv [1] is the string '600'. To control the number of die rolls 
with the command-line argument’s value, we modified the statement that creates the 


rolls list as follows: 


lick here to view code image 


rolls = [random.randrange(l, 7) for i in range(int(sys.argv[1]))] 


Note that we converted the argv [1] string to an int. 


Matplotlib and Seaborn do not automatically display the plot for you 
when you create it in a script. So at the end of the script we added the following 


call to Matplotlib’s show function, which displays the window containing the graph: 


plt.show() 


5.18 WRAP-UP 


This chapter presented more details of the list and tuple sequences. You created lists, 
accessed their elements and determined their length. You saw that lists are mutable, so 
you can modify their contents, including growing and shrinking the lists as your 


programs execute. You saw that accessing a nonexistent element causes an 





IndexError. You used for statements to iterate through list elements. 


We discussed tuples, which like lists are sequences, but are immutable. You unpacked a 
tuple’s elements into separate variables. You used enumerate to create an iterable of 


tuples, each with a list index and corresponding element value. 


You learned that all sequences support slicing, which creates new sequences with 
subsets of the original elements. You used the del statement to remove elements from 
lists and delete variables from interactive sessions. We passed lists, list elements and 
slices of lists to functions. You saw how to search and sort lists, and how to search 
tuples. We used list methods to insert, append and remove elements, and to reverse a 


list’s elements and copy lists. 


We showed how to simulate stacks with lists. We used the concise list-comprehension 
notation to create new lists. We used additional built-in methods to sum list elements, 
iterate backward through a list, find the minimum and maximum values, filter values 
and map values to new values. We showed how nested lists can represent two- 


dimensional tables in which data is arranged in rows and columns. You saw how nested 





for loops process two-dimensional lists. 


The chapter concluded with an Intro to Data Science section that presented a die- 


olling simulation and static visualizations. A detailed code example used the Seaborn 
and Matplotlib visualization libraries to create a static bar plot visualization of the 
simulation’s final results. In the next Intro to Data Science section, we use a die-rolling 


simulation with a dynamic bar plot visualization to make the plot “come alive.” 


In the next chapter, “Dictionaries and Sets,” we'll continue our discussion of Python’s 
built-in collections. We'll use dictionaries to store unordered collections of key—value 
pairs that map immutable keys to values, just as a conventional dictionary maps words 


to definitions. We'll use sets to store unordered collections of unique elements. 


In the “Array-Oriented Programming with NumPy” chapter, we'll discuss NumPy’s 
ndarray collection in more detail. You'll see that while lists are fine for small amounts 
of data, they are not efficient for the large amounts of data you'll encounter in big data 
analytics applications. For such cases, the NumPy library’s highly optimized ndarray 
collection should be used. ndarray (n-dimensional array) can be much faster than 
lists. We’ll run Python profiling tests to see just how much faster. As you'll see, NumPy 
also includes many capabilities for conveniently and efficiently manipulating arrays of 
many dimensions. In big data analytics applications, the processing demands can be 
humongous, so everything we can do to improve performance significantly matters. In 


our “ ig Data: Hadoop, Spark, NoSQL and IoT” chapter, you'll use one of the most 
popular high-performance big-data databases—MongoDB. ° 


* The databases name is rooted in the word humongous. 


https://avxhm.se/blogs/hillO 


6. Dictionaries and Sets 


Objectives 

In this chapter, you'll: 

m Use dictionaries to represent unordered collections of key—value pairs. 
E Use sets to represent unordered collections of unique values. 

m Create, initialize and refer to elements of dictionaries and sets. 

mw Iterate through a dictionary’s keys, values and key—value pairs. 

mw Add, remove and update a dictionary’s key—value pairs. 

mw Use dictionary and set comparison operators. 

m Combine sets with set operators and methods. 


m Use operators in and not into determine if a dictionary contains a key or a set 


contains a value. 

m Use the mutable set operations to modify a set’s contents. 

m Use comprehensions to create dictionaries and sets quickly and conveniently. 
m Learn how to build dynamic visualizations. 

mw Enhance your understanding of mutability and immutability. 


Outline 
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.5 Wrap-Up 


6.1 INTRODUCTION 


We've discussed three built-in sequence collections—strings, lists and tuples. Now, we 
consider the built-in non-sequence collections—dictionaries and sets. A dictionary is 
an unordered collection which stores key—value pairs that map immutable keys to 


values, just as a conventional dictionary maps words to definitions. A set is an 


unordered collection of unique immutable elements. 


6.2 DICTIONARIES 


A dictionary associates keys with values. Each key maps to a specific value. The 
following table contains examples of dictionaries with their keys, key types, values and 


value types: 


Values 





Internet country 


Country names Sie STE 
codes 

Decimal numbers inte Roman numerals str 
Agricultural : 

States str list of str 
products 


tuple of ints and 


Hospital patients Str Vital signs 
floats 
Baseball players SEZ Batting averages Float 
Metric oe 
SEE Abbreviations SEE 
measurements 
Inventory codes SEP Quantity in stock inë 


nique Keys 


A dictionary’s keys must be immutable (such as strings, numbers or tuples) and unique 
(that is, no duplicates). Multiple keys can have the same value, such as two different 


inventory codes that have the same quantity in stock. 


6.2.1 Creating a Dictionary 


You can create a dictionary by enclosing in curly braces, { }, a comma-separated list of 
key—value pairs, each of the form key: value. You can create an empty dictionary with 
{}. 


Let’s create a dictionary with the country-name keys 'Finland', 'South Africa' 





and 'Nepal' and their corresponding Internet country code values 'fi', 'za' and 





"np: 


lick here to view code image 


mA country codes- 41 Ulam land! (EIU E VS OUTI 7 Niere ners ie maan: 
"Nepal': Yn} 

tal: country codes 

Oui cay banana i eS Otel Atr ica k eZ ala, Nepali: umon) 


When you output a dictionary, its comma-separated list of key—value pairs is always 
enclosed in curly braces. Because dictionaries are unordered collections, the display 
order can differ from the order in which the key—value pairs were added to the 
dictionary. In snippet [2]’s output the key—value pairs are displayed in the order they 


were inserted, but do not write code that depends on the order of the key—value pairs. 


Determining if a Dictionary Is Empty 


The built-in function len returns the number of key-value pairs in a dictionary: 


TaI Lemi(count sy Codes) 
outei 


You can use a dictionary as a condition to determine if its empty—a non-empty diction- 
ary evaluates to True: 


lick here to view code image 


in O43 ae country codes: 
printe CouniEry codes is not empty') 
: else: 


printi cceuntry codes is empty') 


country codes is not empty 


An empty dictionary evaluates to False. To demonstrate this, in the following code we 
call method clear to delete the dictionary’s key—value pairs, then in snippet [6] we 


recall and re-execute snippet [4]: 


lick here to view code image 


Taisi: country codessiclear() 
PnP ole if country codes: 
printe (Gecuntcry codes is not empty") 
: else: 
prime (C Country codes is empty') 


country codes is empty 


6.2.2 Iterating through a Dictionary 


The following dictionary maps month-name strings to int values representing the 
numbers of days in the corresponding month. Note that multiple keys can have the 


same value: 


lick here to view code image 


be ke days per month — {January Sil, Hebruany o 207 Maren: 3I 


In [2]: days per month 
OUI (January: 231, T Eebruaryi= 287 “March: Silk}, 


Again, the dictionary’s string representation shows the key—value pairs in their 
insertion order, but this is not guaranteed because dictionaries are unordered. We'll 


show how to process keys in sorted order later in this chapter. 


The following for statement iterates through days per month’s key-value pairs. 
Dictionary method items returns each key—value pair as a tuple, which we unpack into 


month and days: 


lick here to view code image 
In [32 for month, days an days per month .atems () + 
print (tt {month} has {days} days") 
January has 31 days 


February has 28 days 
March has 31 days 


6.2.3 Basic Dictionary Operations 


For this section, let’s begin by creating and displaying the dictionary 
roman numerals. We intentionally provide the incorrect value 100 for the key 'X', 


which we'll correct shortly: 


lick here to view code image 


Pat [is roman numerals = s(t: Ly Ta i 2, STEEDS 3p Wr By Xs LOOK 


In [2]: roman numerals 
Overcast Eo: ee eee TEE S V ETOO 


A >» 





Accessing the Value Associated with a Key 


Let’s get the value associated with the key 'v': 


ta l: roman numerals yi] 
Omit (Sis 5 


Updating the Value of an Existing Key-Value Pair 


You can update a key’s associated value in an assignment statement, which we do here 


to replace the incorrect value associated with the key 'X': 
lick here to view code image 
Dn TEA: roman numerats =x" |= 7 10 


In [5]: roman_numerals 
Oue Laake Pe eT Te 2 T S ie Vash Sy Exess exe @iy 


Adding a New Key-Value Pair 


Assigning a value to a nonexistent key inserts the key—value pair in the dictionary: 


lick here to view code image 


in lel: roman numerals [in |= 50 


In [7]: roman_numerals 


Oren A ak Be ile VAI ae SV IIE BS Sh NER ei 


vin 3109 


String keys are case sensitive. Assigning to a nonexistent key inserts a new key—value 


pair. This may be what you intend, or it could be a logic error. 


Removing a Key-Value Pair 


You can delete a key—value pair from a dictionary with the del statement: 


lick here to view code image 


En [Sik del = roman numerals hat | 


In [9]: roman_numerals 


roie (Su pee. Ale VII ee ee Sie a MANO). 


You also can remove a key—value pair with the dictionary method pop, which returns 


the value for the removed key: 


lick here to view code image 


In [10]: roman_numerals.pop('X') 
Onte FPO RO 


In [11]: roman numerals 


orome Ea e Pe T V ay O 


Attempting to Access a Nonexistent Key 


Accessing a nonexistent key results in a KeyError: 


lick here to view code image 





KeyError Traceback 


(most recent call last 


<ipython-input-12-ccd50c7f0c8b> in <module>() 


==—>> J roman numerals hi Tri] 





Reyes: 


Morale lee 




















ou can prevent this error by using dictionary method get, which normally returns its 


argument’s corresponding value. If that key is not found, get returns None. IPython 


does not display anything when None is returned in snippet [13]. If you specify a 


second argument to get, it returns that value if the key is not found: 


lick here to view code image 


iy SI: 


TAANE 
out LAI: 


TASES 
Owedi: 


roman numerals.get ("III") 


roman numerals.get('III', Titi h Wor am o reclonary 


Tit note in dictionary: 





roman_numerals.get('V') 
5 


Testing Whether a Dictionary Contains a Specified Key 


Operators in and not in can determine whether a dictionary contains a specified key: 


lick here to view code image 


imal, IKESI S 
Owe Mie]: 


aeae EIS 
Orea: 


eras Eens 
Qute: 





y" in roman numerals 


True 


TIT imn roman numerals 


False 


'TII' not in roman numerals 


True 


6.2.4 Dictionary Methods keys and values 


Earlier, we used dictionary method items to iterate through tuples of a dictionary’s 


key—value pairs. Similarly, methods keys and values can be used to iterate through 


only a dictionary’s keys or values, respectively: 


lick here to view code image 


in e months: = {'Jjanwary" 1, "February: 


ta [2]: for month mame in months. keys (O): 


print (month name, end=' uy 


January February March 


In [3] for month number in months.values (): 


print (month number, end=' 1) 


Dictionary Views 


27 Maren: 


so 


Dictionary methods items, keys and values each return a view of a dictionary’s data. 


When you iterate over a view, it “sees” the dictionary’s current contents—it does not 


have its own copy of the data. 


To show that views do not maintain their own copies of a dictionary’s data, let’s first 


save the view returned by keys into the variable months _ view, then iterate through 


lick here to view code image 


In [4]: months view = months.keys () 


in isie for key an months vaiew:: 


print (key, end=' vi) 


January February March 


Next, let’s add a new key-value pair to months and display the updated dictionary: 


lick here to view code image 


In [6]: months['December'] = 12 
Ta e months 
oueli: (January: l; TEebruaryt: 2; ‘Marchi: 3; "December': 12} 


Now, let’s iterate through months view again. The key we added above is indeed 


displayed: 


lick here to view code image 


In [8]: for key in months view: 


print (key, end=' ') 


January February March December 
Do not modify a dictionary while iterating through a view. According to Section 4.10.1 


of the Python Standard Library documentation, * either you'll get a RuntimeError or 


the loop might not process all of the view’s values. 


ttps://docs.python.org/3/library/stdtypes.html#dictionary- 


i1ew-objects. 


Converting Dictionary Keys, Values and Key-Value Pairs to Lists 


You might occasionally need lists of a dictionary’s keys, values or key—value pairs. To 
obtain such a list, pass the view returned by keys, values or items to the built-in 


list function. Modifying these lists does not modify the corresponding dictionary: 


lick here to view code image 


Ta pole Lasit(montehs keys) 








Out[9]: ['January', 'February', 'March', 'December' ] 

In [10]: list (months.values() ) 

our [OR a ei 2] 

im Mil: 1ase(months.sseemsi())) 

Omit las ie Janua Eyi y iy C ERebruary ; 2), ("Mareh' 3), (' December”; 12 
<i i > 


















































rocessing Keys in Sorted Order 


To process keys in sorted order, you can use built-in function sorted as follows: 


lick here to view code image 
In [12]: for month name in sorted (months. keys ())) ; 
print (month name, end=' 9) 


February December January March 


6.2.5 Dictionary Comparisons 


The comparison operators == and != can be used to determine whether two 
dictionaries have identical or different contents. An equals (==) comparison evaluates 
to True if both dictionaries have the same key—value pairs, regardless of the order in 


which those key—value pairs were added to each dictionary: 


lick here to view code image 














En) je country capitalsd—.{ “Belouum': Brussels, 
"Haiti's "Port au- Princet) 

in [lis country capitals2 = ("Nepali; "Kathmandu', 
“Uruguay! "Montevideo' } 

in Sie country capitalise =- {  Haaied UPort- au Prince, 
"Belgium": "Brussels * } 

In [4] country capitals] = country capitals2 


Out[4]: False 


ASi: counery capitals = country: capitals 
Out[5]: True 


Pn [oJ countsy capitals I= country capitals? 
Oútlel: True 





6.2.6 Example: Dictionary of Student Grades 


The following script represents an instructor’s grade book as a dictionary that maps 
each student’s name (a string) to a list of integers containing that student’s grades on 
three exams. In each iteration of the loop that displays the data (lines 13—17), we 
unpack a key—value pair into the variables name and grades containing one student’s 
name and the corresponding list of three grades. Line 14 uses built-in function sum to 
total a given student’s grades, then line 15 calculates and displays that student’s average 
by dividing total by the number of grades for that student (len (grades) ). Lines 16- 
17 keep track of the total of all four students’ grades and the number of grades for all 
the students, respectively. Line 19 prints the class average of all the students’ grades on 


all the exams. 


lick here to view code image 


1 4 £19g06 0l. py 


2 """Using a dictionary to represent an instructors grade book. mu 





3 grade book = { 

4 rousa s [227 2997 Olu 

5 Imduardo': (ss. 95 TS] 

6 IRA tatae Ol sho ye S2] 

7 TPantapa e oTa evils 92] 

8 } 

9 

10 all _grades_ total =O 

11 all grades ‘count = 

12 

13 for name, grades in grade _book.items(): 
14 total = sum(grades) 

15 print(f'Average for {name} is {total/len(grades):.2f}') 
16 all grades total tae Ocul 

17 all grades count += len(grades) 

18 


19 print (f"Class's average is: {all grades total / all grades ‘counts. 2 & 





4 | : > 








lick here to view code image 


Average for Susan is 92.33 
Average for Eduardo is 85.67 


Average for Azizi is 87.33 





Average for Pantipa is 93.33 


Class's average is: 89.67 





6.2.7 Example: Word Counts 2 


* Techniques like word frequency counting are often used to analyze published works. 
For example, some people believe that the works of William Shakespeare actually might 
have been written by Sir Francis Bacon, Christopher Marlowe or others. Comparing the 
word frequencies of their works with those of Shakespeare can reveal writing-style 
similarities. Well look at other document-analysis techniques in the Natural Language 


Processing (NLP) chapter. 


The following script builds a dictionary to count the number of occurrences of each 
word in a string. Lines 4—5 create a string text that we'll break into words—a process 
known as tokenizing a string. Python automatically concatenates strings separated 
by whitespace in parentheses. Line 7 creates an empty dictionary. The dictionary’s keys 
will be the unique words, and its values will be integer counts of how many times each 


word appears in text. 


lick here to view code image 





if £1906 02. py 
2 "To kenizang a String and counting unaque words Imn 
3 
4 text = ('this is sample text with several words ' 
5 "this is more sample text with some different words") 
6 
7 word counts: = { } 
8 
9 # count occurrences of each unique word 
T0 for word in text splat) 
11 if word in word counts: 
12 word counts [word] += 1 # update existing key-value pair 
13 else: 
14 word counts [word] = 1 # insert new key-value pair 
15 
T6 print (E WORD <12 VeOUND 1) 
17 
18 for word, count in sorted(word counts.items()): 
19 print (E (word: <12)} (counti) 
20 
21 print (!\nNumber of unique words: ', len(word counts) ) 
4 > 





lick here to view code image 


WORD 
different 
is 

more 
sample 
several 
some 

text 

ches 

with 


words 


Number of unique words: 10 





Line 10 tokenizes text by calling string method split, which separates the words 
using the method’s delimiter string argument. If you do not provide an argument, 
split uses a space. The method returns a list of tokens (that is, the words in text). 
Lines 10—14 iterate through the list of words. For each word, line 11 determines 
whether that word (the key) is already in the dictionary. If so, line 12 increments that 


word’s count; otherwise, line 14 inserts a new key—value pair for that word with an 


initial count of 1. 


Lines 16—21 summarize the results in a two-column table containing each word and its 
corresponding count. The for statement in lines 18 and 19 iterates through the 
diction-ary’s key—value pairs. It unpacks each key and value into the variables word 
and count, then displays them in two columns. Line 21 displays the number of unique 


words. 


Python Standard Library Module collections 


The Python Standard Library already contains the counting functionality that we 
implemented using the dictionary and the loop in lines 10—14. The module 
collections contains the type Counter, which receives an iterable and summarizes 
its elements. Let’s reimplement the preceding script in fewer lines of code with 


Counter: 


lick here to view code image 





rn [1]: from collections import Counter 
In [2]: text = ('this is sample text with several words ' 
"this is more sample text with some different words') 
In [3]: counter = Counter(text.split()) 
in [4]: for word, count an sorted (counter. 1tems ()) : 


print (tr (word!<12) (count) 
different 
alts) 
more 


sample 


several 


text 
this 


Ji 

2 

Ji 

2 

1 

some ail 
2 

2 

with 2 
2 


words 
in L5]: prine (Number of unique keys:', len(counter.keys())) 


Number of unique keys: 10 


Snippet [3] creates the Counter, which summarizes the list of strings returned by 
text.split().Insnippet [4], Counter method items returns each string and its 


associated count as a tuple. We use built-in function sorted to get a list of these tuples 


in ascending order. By default sorted orders the tuples by their first elements. If those 





are identical, then it looks at the second element, and so on. The for statement iterates 


over the resulting sorted list, displaying each word and count in two columns. 


6.2.8 Dictionary Method update 


You may insert and update key—value pairs using dictionary method update. First, 


let’s create an empty country codes dictionary: 
In [1]: country codes = {} 


The following update call receives a dictionary of key—value pairs to insert or update: 


lick here to view code image 


in Pcs country codes update {Seu eh Af ca@at zai) 
rn Si country codes 
üti: {Scut Africa”: “zai } 


Method update can convert keyword arguments into key—value pairs to insert. The 
following call automatically converts the parameter name Australia into the string 


key 'Australia' and associates the value 'ar' with that key: 


lick here to view code image 


In [4]: country codes.update (Australia='ar') 
in [Si Country codes 
Omir South African: “zal Mustra la: Tart) 


Snippet [4] provided an incorrect country code for Australia. Let’s correct this by 


using another keyword argument to update the value associated with 'Australia': 


lick here to view code image 


rne]: country codes- update (Aust radaa= aut) 


TAE COUME ry CodeS 


Cutii South Africa: tza lAustralia i: taut) 


Method update also can receive an iterable object containing key—value pairs, such as 


a list of two-element tuples. 


6.2.9 Dictionary Comprehensions 


Dictionary comprehensions provide a convenient notation for quickly generating 
dictionaries, often by mapping one dictionary to another. For example, in a dictionary 


with unique values, you can reverse the key—value pairs: 


lick here to view code image 


in ike months = "January: 1 February ss 2, "Manel ss Se 
In [2]: months2 = {number: name for name, number in months.items() } 


in [3]: months 2 
OuEISit Hk VWwanwary" 7 25 “bebnuary! 7 3: Macchu) 





Curly braces delimit a dictionary comprehension, and the expression to the left of the 
for clause specifies a key—value pair of the form key: value. The comprehension 
iterates through months. items (), unpacking each key—value pair tuple into the 
variables name and number. The expression number: name reverses the key and 


value, so the new dictionary maps the month numbers to the month names. 


What if months contained duplicate values? As these become the keys in months2, 
attempting to insert a duplicate key simply updates the existing key’s value. So if 
'February' and 'March' both mapped to 2 originally, the preceding code would 


have produced 


1i danwany 2 Marchu 


A dictionary comprehension also can map a dictionary’s values to new values. The 
following comprehension converts a dictionary of names and lists of grades into a 
dictionary of names and grade-point averages. The variables k and v commonly mean 


key and value: 


lick here to view code image 


in [4c grades = A Sue": T98, 877 84), ‘Bob's T34, $5, 911} 


In [5]: grades2 = {k: sum(v) / len(v) for k, v in grades.items() } 


In [6]: grades2 
Guelei: (Suet: 930), Bob 9 0R0R 





The comprehension unpacks each tuple returned by grades.items () into k (the 
name) and v (the list of grades). Then, the comprehension creates a new key—value pair 


with the key k and the value of sum (v) / len (v), which averages the list’s elements. 


6.3 SETS 


A set is an unordered collection of unique values. Sets may contain only immutable 
objects, like strings, ints, floats and tuples that contain only immutable elements. 
Though sets are iterable, they are not sequences and do not support indexing and 


slicing with square brackets, []. Dictionaries also do not support slicing. 


Creating a Set with Curly Braces 


The following code creates a set of strings named colors: 


lick here to view code image 


in. (Pats colors = {Iredi lorange; ‘yellow, ‘green y tred r “pluen 
ta 2: Coors 
Outl2l: (plue “ogreen”™, Yorange’, ‘redu, “yellow” } 


Notice that the duplicate string 'red' was ignored (without causing an error). An 
important use of sets is duplicate elimination, which is automatic when creating a 
set. Also, the resulting set’s values are not displayed in the same order as they were 
listed in snippet [1]. Though the color names are displayed in sorted order, sets are 


unordered. You should not write code that depends on the order of their elements. 


Determining a Set’s Length 


You can determine the number of items in a set with the built-in len function: 


ta [Si len (colors) 
Ome Sas S5 


Checking Whether a Value Is in a Set 


You can check whether a set contains a particular value using the in and not in 


operators: 


lick here to view code image 


To [ke Sred! In COlorsS 
Out[4]: True 

TAES] Ipurplen ani Colors 
Qutli: Ealse 

in Loli purple” not In colors 
Out[6]: True 





Iterating Through a Set 





Sets are iterable, so you can process each set element with a 4 


lick here to view code image 





in Wilks bor wcollor In Colors: 
print (color.upper(), end=' SA) 
RED GREEN YELLOW BLUE ORANGE 























for loop: 


Sets are unordered, so there’s no significance to the iteration order. 


Creating a Set with the Built-In set Function 


You can create a set from another collection of values by using the built-in set function 


—here we create a list that contains several duplicate integer 


set’s argument: 


lick here to view code image 


In [8]: numbers = list(range(10)) + list (range(5 
In [9]: numbers 

Omics ANS NOs tae Si SS o Wi shy le ye Sal re Sie 
In [10]: set (numbers) 

Grene NOW On Ae aA Sin SP Si Ge We tls BSN: 


If you need to create an empty set, you must use the set fun 


values and use that list as 


M) 


ction with empty 


parentheses, rather than empty braces, { }, which represent an empty dictionary: 


Daya Scien) 
Out Ma se seit) 


Python displays an empty set as set () to avoid confusion with Python’s string 


representation of an empty dictionary ({ }). 


Frozenset: An Immutable Set Type 


Sets are mutable—you can add and remove elements, but set elements must be 
immutable. Therefore, a set cannot have other sets as elements. A frozenset is an 
immutable set—it cannot be modified after you create it, so a set can contain frozensets 


as elements. The built-in function frozenset creates a frozenset from any iterable. 


6.3.1 Comparing Sets 


Various operators and methods can be used to compare sets. The following sets contain 


the same values, so == returns True and != returns False. 


lick here to view code image 


Ihar dpa tele, Si, oo, SS tar Si au 
Out [Lis True 


dn Pk oy) Sr S; SA 1 
Out[2]: False 


The < operator tests whether the set to its left is a proper subset of the one to its right 
—that is, all the elements in the left operand are in the right operand, and the sets are 


not equal: 


lick here to view code image 


ine ey ale Si St << (385, DD sy} 
Out (Si False 


Ta Aa S T T” oie lei 
Out[4]: True 


The <= operator tests whether the set to its left is an improper subset of the one to its 
right—that is, all the elements in the left operand are in the right operand, and the sets 


might be equal: 


lick here to view code image 
ima le el, eae Sy = aos ad 
Oue LS]: True 
Tooley, si <= a one 1h 


Out lel: Prue 


You may also check for an improper subset with the set method issubset: 


lick here to view code image 
Ea ae a Sie SP ae SuUbset Ao Sr 1) 
Out I: True 
EnS lka ils a subse tE S oy iL) 


Out[8]: False 


The > operator tests whether the set to its left is a proper superset of the one to its 
right—that is, all the elements in the right operand are in the left operand, and the left 


operand has more elements: 


lick here to view code image 
Ta oe ie Sk ha en a lt 
Out[9]: False 
wa Hoe e bye ee a 


Out LOT: ewe 


The >= operator tests whether the set to its left is an improper superset of the one to 
its right—that is, all the elements in the right operand are in the left operand, and the 
sets might be equal: 


lick here to view code image 
ma e e aa a e a 
Ouel e Erue 


Ika ea e ee yee Ly 
Outs [ale eee 


iol sso. alee sie Se ers ile act 
Out[13]: False 





You may also check for an improper superset with the set method issuperset: 


lick here to view code image 


Eee Gas lee Se Oe SURS rS Stl on Sy 1) 
Qut [14] “rue 


Ta PrI a We Sy ores Sper Seite (esa 13) 
Omit (ES False 


The argument to issubset or issuperset can be any iterable. When either of these 
methods receives a non-set iterable argument, it first converts the iterable to a set, then 


performs the operation. 


6.3.2 Mathematical Set Operations 


This section presents the set type’s mathematical operators |, &, - and ^ and the 


corresponding methods. 


Union 


The union of two sets is a set consisting of all the unique elements from both sets. You 


can calculate the union with the | operator or with the set type’s union method: 


lick here to view code image 


Te si a tere Sat 
ore e aa 2 ei Ge e 


Inet (Liye abil si o eithatabone Ala keh sie SO 280) I) 
One ae ln sie Si Zl 0s, 


The operands of the binary set operators, like |, must both be sets. The corresponding 
set methods may receive any iterable object as an argument—we passed a list. When a 
mathematical set method receives a non-set iterable argument, it first converts the 
iterable to a set, then applies the mathematical operation. Again, though the new sets’ 
string representations show the values in ascending order, you should not write code 


that depends on this. 


Intersection 


The intersection of two sets is a set consisting of all the unique elements that the two 


sets have in common. You can calculate the intersection with the & operator or with 


the set type’s intersection method: 


lick here to view code image 
im fol «i, 6, 5) & f2, 3, 2h 
outil: {F 


Eo ae il E o Dan eers ection (lin A a sire S A A 
oue PAI 


Difference 


The difference between two sets is a set consisting of the elements in the left operand 
that are not in the right operand. You can calculate the difference with the - operator 


or with the set type’s difference method: 


lick here to view code image 


wa ele lye ste Se a e ln) 
ouelo {i 5} 


Ww 


ta lel: (1; 
Outlok (al, 


po, ih adahterence l2 2, 3, 3, 4, 4p 
I} 


(Sz 
~ 


Symmetric Difference 


The symmetric difference between two sets is a set consisting of the elements of 
both sets that are not in common with one another. You can calculate the symmetric 
difference with the * operator or with the set type’s symmetric_difference 


method: 


lick here to view code image 


Taa Sl Sia one oe Se sie A 
Out Wl alee ee S) 
Ta elc {0, 3, 5, -symmetric differencelll2, 2 Sr o A Ag 
Oue redke le 2 Aa S F 
Disjoint 


Two sets are disjoint if they do not have any common elements. You can determine 


this with the set type’s isdisjoint method: 


lick here to view code image 


rie ODES (ee eS Se Sash Oana eae Orla) 
Oucl(9]s True 


Ta LO He etl ge eles o odio ine (ata 6 aileg)) 
Out[10]: False 


6.3.3 Mutable Set Operators and Methods 


The operators and methods presented in the preceding section each result in a new set. 


Here we discuss operators and methods that modify an existing set. 


Mutable Mathematical Set Operations 


Like operator |, union augmented assignment |= performs a set union operation, 


but | = modifies its left operand: 
lick here to view code image 
in Pc numbers = 11; 3; 5} 


In [2]: numbers |= {2, 3, 4} 


In [3]: numbers 
Oswell edie ely 2 Sip A 


Similarly, the set type’s update method performs a union operation modifying the set 


on which it’s called—the argument can be any iterable: 


lick here to view code image 


In [4]: numbers.update (range (10) ) 


In [5]: numbers 
OU PSIG Oia epee Ort rt, o i S e 


The other mutable set methods are: 


e intersection augmented assignment «= 


e difference augmented assignment -= 


e symmetric difference augmented assignment ^= 


and their corresponding methods with iterable arguments are: 
* intersection update 

e difference update 

e symmetric difference update 


Methods for Adding and Removing Elements 


Set method add inserts its argument if the argument is not already in the set; 


otherwise, the set remains unchanged: 


lick here to view code image 


In [6]: numbers.add(17) 
ta [vis numbers addis) 


In [8]: numbers 
(OCHS: AMO wl eee ele Ul a Moi We ch Sie Alaa 





Set method remove removes its argument from the set—a KeyError occurs if the 


value is not in the set: 


lick here to view code image 


In [9]: numbers.remove (3) 


In [10]: numbers 
Cut OI SLO sly 2 eee S i ir Sy See Aba! 


Method discard also removes its argument from the set but does not cause an 


exception if the value is not in the set. 


You also can remove an arbitrary set element and return it with pop, but sets are 


unordered, so you do not know which element will be returned: 


lick here to view code image 


In [11]: numbers pop) 
Out riang 
In [12]: numbers 


(oyote Ae Sal Bie Beles en Ore pe Sele Shy Lah 


A KeyError occurs if the set is empty when you call pop. 
Finally, method clear empties the set on which it’s called: 


In [13]: numbers.clear() 


In [14]: numbers 
out PAI: set) 


6.3.4 Set Comprehensions 


Like dictionary comprehensions, you define set comprehensions in curly braces. Let’s 


create a new set containing only the unique even values in the list numbers: 


lick here to view code image 





in [iis numbers = M 2 2, 3, 4, 5, 6, 6, 7, 8, 97 10; r0] 
In [2]: evens = {item for item in numbers if item % 2 == 0} 
In [3]: evens 

Ona DSi (2 4 6; 8, 10} 





6.4 INTRO TO DATA SCIENCE: DYNAMIC 
VISUALIZATIONS 


The preceding chapter’s Intro to Data Science section introduced visualization. We 
simulated rolling a six-sided die and used the Seaborn and Matplotlib visualization 
libraries to create a publication-quality static bar plot showing the frequencies and 
percentages of each roll value. In this section, we make things “come alive” with 


dynamic visualizations. 


The Law of Large Numbers 


hen we introduced random-number generation, we mentioned that if the random 
module’s randrange function indeed produces integers at random, then every number 
in the specified range has an equal probability (or likelihood) of being chosen each time 
the function is called. For a six-sided die, each value 1 through 6 should occur one-sixth 
of the time, so the probability of any one of these values occurring is 1/ 6™ or about 
16.667%. 


In the next section, we create and execute a dynamic (that is, animated) die-rolling 
simulation script. In general, you'll see that the more rolls we attempt, the closer each 
die value’s percentage of the total rolls gets to 16.667% and the heights of the bars 


gradually become about the same. This is a manifestation of the law of large numbers. 


6.4.1 How Dynamic Visualization Works 


The plots produced with Seaborn and Matplotlib in the previous chapter’s Intro to Data 
Science section help you analyze the results for a fixed number of die rolls after the 
simulation completes. This section’s enhances that code with the Matplotlib 
animation module’s FuncAnimation function, which updates the bar plot 
dynamically. You'll see the bars, die frequencies and percentages “come alive,” 
updating continuously as the rolls occur. 


Animation Frames 


FuncAnimation drives a frame-by-frame animation. Each animation frame 
specifies everything that should change during one plot update. Stringing together 
many of these updates over time creates the animation effect. You decide what each 


frame displays with a function you define and pass to FuncAnimation. 


Each animation frame will: 


e roll the dice a specified number of times (from 1 to as many as you'd like), updating 


die frequencies with each roll, 
e clear the current plot, 
e create a new set of bars representing the updated frequencies, and 


e create new frequency and percentage text for each bar. 


Generally, displaying more frames-per-second yields smoother animation. For 


example, video games with fast-moving elements try to display at least 30 frames-per- 


second and often more. Though you'll specify the number of milliseconds between 
animation frames, the actual number of frames-per-second can be affected by the 
amount of work you perform in each frame and the speed of your computer’s processor. 
This example displays an animation frame every 33 milliseconds—yielding 
approximately 30 (1000 / 33) frames-per-second. Try larger and smaller values to see 
how they affect the animation. Experimentation is important in developing the best 


visualizations. 


Running Rol 1DieDynamic.py 


In the previous chapter’s Intro to Data Science section, we developed the static 
visualization interactively so you could see how the code updates the bar plot as you 
execute each statement. The actual bar plot with the final frequencies and percentages 


was drawn only once. 


For this dynamic visualization, the screen results update frequently so that you can see 
the animation. Many things change continuously—the lengths of the bars, the 
frequencies and percentages above the bars, the spacing and labels on the axes and the 
total number of die rolls shown in the plot’s title. For this reason, we present this 


visualization as a script, rather than interactively developing it. 


The script takes two command-line arguments: 


e number of frames—The number of animation frames to display. This value 
determines the total number of times that FuncAnimation updates the graph. For 
each animation frame, FuncAnimation calls a function that you define (in this 


example, update) to specify how to change the plot. 


e rolls per frame—The number of times to roll the die in each animation frame. 
We'll use a loop to roll the die this number of times, summarize the results, then 


update the graph with bars and text representing the new frequencies. 


To understand how we use these two values, consider the following command: 


ipython RollDieDynamic.py 6000 1 


In this case, FuncAnimation calls our update function 6000 times, rolling one die 
per frame for a total of 6000 rolls. This enables you to see the bars, frequencies and 
percentages update one roll at a time. On our system, this animation took about 3.33 


minutes (6000 frames / 30 frames-per-second / 60 seconds-per-minute) to show you 


only 6000 die rolls. 


Displaying animation frames to the screen is a relatively slow input—output-bound 
operation compared to the die rolls, which occur at the computer’s super fast CPU 
speeds. If we roll only one die per animation frame, we won't be able to run a large 
number of rolls in a reasonable amount of time. Also, for small numbers of rolls, you’re 


unlikely to see the die percentages converge on their expected 16.667% of the total rolls. 


To see the law of large numbers in action, you can increase the execution speed by 


rolling the die more times per animation frame. Consider the following command: 


ipython RollDieDynamic.py 10000 600 


In this case, FuncAnimation will call our update function 10,000 times, performing 
600 rolls-per-frame for a total of 6,000,000 rolls. On our system, this took about 5.55 
minutes (10,000 frames / 30 frames-per-second / 60 seconds-per-minute), but 
displayed approximately 18,000 rolls-per-second (30 frames-per-second * 600 rolls- 
per-frame), so we could quickly see the frequencies and percentages converge on their 


expected values of about 1,000,000 rolls per face and 16.667% per face. 


Experiment with the numbers of rolls and frames until you feel that the program is 
helping you visualize the results most effectively. It’s fun and informative to watch it 


run and to tweak it until you’re satisfied with the animation quality. 


Sample Executions 


We took the following four screen captures during each of two sample executions. In 
the first, the screens show the graph after just 64 die rolls, then again after 604 of the 
6000 total die rolls. Run this script live to see over time how the bars update 
dynamically. In the second execution, the screen captures show the graph after 7200 
die rolls and again after 166,200 out of the 6,000,000 rolls. With more rolls, you can 
see the percentages closing in on their expected values of 16.667% as predicted by the 


law of large numbers. 


Execute 6000 animation frames rolling the die once per frame: 
ipython RollDieDynamic.py 6000 1 
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Execute 10,000 animation frames rolling the die 600 times per frame: 
ipython Rol1lDieDynamic.py 10000 600 





Die Ala toons for 7,200 Rolls 
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.4.2 Implementing a Dynamic Visualization 


The script we present in this section uses the same Seaborn and Matplotlib features 
shown in the previous chapter’s Intro to Data Science section. We reorganized the code 


for use with Matplotlib’s animation capabilities. 


Importing the Matplotlib animation Module 


We focus primarily on the new features used in this example. Line 3 imports the 


Matplotlib animation module. 


lick here to view code image 


1 # RollDieDynamic.py 


"""Dynamically graphing frequencies of die Tors nu 
Erom matplot lib Import animation 

import matplot lrbo- py plot as pilie 

import random 

import seaborn as sns 


import sys 


o HD UU F&F WW DN 


Function update 


Lines 9—27 define the update function that FuncAnimation calls once per animation 
frame. This function must provide at least one argument. Lines 9—10 show the 


beginning of the function definition. The parameters are: 


e frame number—The next value from Func-Animation’s frames argument, 
which we'll discuss momentarily. Though FuncAnimation requires the update 


function to have this parameter, we do not use it in this update function. 
e rolls—The number of die rolls per animation frame. 
e faces—The die face values used as labels along the graph’s x-axis. 


e frequencies—The list in which we summarize the die frequencies. 


We discuss the rest of the function’s body in the next several subsections. 


lick here to view code image 


9 def update(frame number, rolls, faces, frequencies): 








10 MU OConruguees bat plot Contents hor cach animation Brame, HN 


Function update: Rolling the Die and Updating the frequencies List 





Lines 12—13 roll the die rolls times and increment the appropriate frequencies 


element for each roll. Note that we subtract 1 from the die value (1 through 6) before 








incrementing the corresponding frequencies element—as you'll see, frequencies 


is a six-element list (defined in line 36), so its indices are 0 through 5. 


lick here to view code image 


aiok # roll die and update frequencies 


12 for a in range (rolls) : 


13 frequencies [random.randrange(1, 7) - 1] += 1 
14 


Function update: Configuring the Bar Plot and Text 


Line 16 in function update calls the matplotlib.pyplot module’s cla (clear axes) 
function to remove the existing bar plot elements before drawing new ones for the 
current animation frame. We discussed the code in lines 17—27 in the previous 
chapter’s Intro to Data Science section. Lines 17—20 create the bars, set the bar plot’s 
title, set the x- and y-axis labels and scale the plot to make room for the frequency and 


percentage text above each bar. Lines 23—27 display the frequency and percentage text. 


lick here to view code image 

















15 # reconfigure plot for updated die frequencies 
16 plt.cla() # clear old contents contents of current Figure 
17 axes = sns .barplot (faces, frequencies, palette='bright') # ne 
18 axes See titled Dre Frequencies, for ((sum(tirequenciesi) <7, h Ro Tiksi 
19 axes.set(xlabel='Die Value', ylabel='Frequency') 
20 axes.set_ylim(top=max (frequencies) x KO) te Seale y axis by 
21 
22 # display frequency & percentage above each patch (bar) 
23 for bar, frequency in zip(axes.patches, frequencies): 
24 text_x = bar.get_x() + bar.get_width() / 2.0 
25 text_y = bar. get herght() 
26 text = f'{frequency:, }\n{frequency / sum(frequencies):.3%} 
27 axes, Lext (text x, EXER, Cert has Cemren | iva ms botrom,) 
28 
4 > 


























Variables Used to Configure the Graph and Maintain State 


Lines 30 and 31 use the sys module’s argv list to get the script’s command-line 
arguments. Line 33 specifies the Seaborn 'whitegrid' style. Line 34 calls the 
matplotlib.pyplot module’s figure function to get the Figure object in which 
FuncAnimation displays the animation. The function’s argument is the window’s title. 
As youl soon see, this is one of FuncAnimation’s required arguments. Line 35 creates 
a list containing the die face values 1—6 to display on the plot’s x-axis. Line 36 creates 
the six-element frequencies list with each element initialized to 0—we update this 


list’s counts with each die roll. 


lick here to view code image 


29 # read command-line arguments for number of frames and rolls per frai 


30 number of frames = ant (sys argv HN 








31 rolls per frame = int (sys.-argvi2]) 

32 

33 sns.set_style('whitegrid') # white background with gray grid lines 
34 figure = plt.figure('Rolling a Six-Sided Die') # Figure for animati 
35 values = list (range(l, 7)) # die faces for display on x-axis 

36 frequencies = [0] * 6 # six-element list of die frequencies 

37 








alling the animation Module’s FuncAnimation Function 


Lines 39-41 call the Matplotlib animation module’s FuncAnimation function to 
update the bar chart dynamically. The function returns an object representing the 
animation. Though this is not used explicitly, you must store the reference to the 
animation; otherwise, Python immediately terminates the animation and returns its 


memory to the system. 


lick here to view code image 


38 # configure and start animation that calls function update 








39 die animation = animation. FuncAnimation ( 

40 figure, update, repeat=False, frames=number of frames, interva 
41 ftargs= (rolls per frame; values, frequencies) ) 

42 

43 plt.show() # display window 











FuncAnimation has two required arguments: 


e figure—the Figure object in which to display the animation, and 


e update—the function to call once per animation frame. 


In this case, we also pass the following optional keyword arguments: 


e repeat—False terminates the animation after the specified number of frames. If 


True (the default), when the animation completes it restarts from the beginning. 


e frames—The total number of animation frames, which controls how many times 
FunctAnimation calls update. Passing an integer is equivalent to passing a 
range—for example, 600 means range (600). FuncAnimation passes one value 


from this range as the first argument in each call to update. 


e interval—The number of milliseconds (33, in this case) between animation 
frames (the default is 200). After each call to update, FuncAnimation waits 33 


milliseconds before making the next call. 


e fargs (short for “function arguments”)—A tuple of other arguments to pass to the 
function you specified in FuncAnimation’s second argument. The arguments you 
specify in the fargs tuple correspond to update’s parameters rolls, faces and 


frequencies (line 9). 


For a list of FuncAnimation’s other optional arguments, see 


ttps://matplotlib.org/api/_as_gen/matplotlib.animation.FuncAnimation.html 








> 








Finally, line 43 displays the window. 


6.5 WRAP-UP 


In this chapter, we discussed Python’s dictionary and set collections. We said what a 
dictionary is and presented several examples. We showed the syntax of key—value pairs 
and showed how to use them to create dictionaries with comma-separated lists of key— 
value pairs in curly braces, { }. You also created dictionaries with dictionary 


comprehensions. 


You used square brackets, [], to retrieve the value corresponding to a key, and to insert 
and update key—value pairs. You also used the dictionary method update to change a 


key’s associated value. You iterated through a dictionary’s keys, values and items. 


You created sets of unique immutable values. You compared sets with the comparison 
operators, combined sets with set operators and methods, changed sets’ values with the 
mutable set operations and created sets with set comprehensions. You saw that sets are 


mutable. Frozensets are immutable, so they can be used as set and frozenset elements. 


In the Intro to Data Science section, we continued our visualization introduction by 
presenting the die-rolling simulation with a dynamic bar plot to make the law of large 
numbers “come alive.” In addition, to the Seaborn and Matplotlib features shown in the 
previous chapter’s Intro to Data Science section, we used Matplotlib’s FuncAnimation 
function to control a frame-by-frame animation. FuncAnimation called a function we 


defined that specified what to display in each animation frame. 


n the next chapter, we discuss array-oriented programming with the popular NumPy 
library. As you'll see, NumPy’s ndarray collection can be up to two orders of 
magnitude faster than performing many of the same operations with Python’s built-in 


lists. This power will come in handy for today’s big data applications. 


. Array-Oriented Programming with NumPy 


Objectives 

In this chapter you'll: 

m Learn how arrays differ from lists. 

mw Use the numpy module’s high-performance ndarrays. 
m Compare list and ndarray performance with the [Python St imeit magic. 
m Use ndarrays to store and retrieve data efficiently. 

m Create and initialize ndarrays. 

m Refer to individual ndarray elements. 

m Iterate through ndarrays. 

mw Create and manipulate multidimensional ndarrays. 
m Perform common ndarray manipulations. 


m Create and manipulate pandas one-dimensional Series and two-dimensional 


DataFrames. 

m Customize Series and DataFrame indices. 

mw Calculate basic descriptive statistics for data in a Series and a DataFrame. 
m Customize floating-point number precision in pandas output formatting. 


Outline 
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7.1 INTRODUCTION 


The NumPy (Numerical Python) library first appeared in 2006 and is the preferred 
Python array implementation. It offers a high-performance, richly functional n- 
dimensional array type called ndarray, which from this point forward we'll refer to by 
its synonym, array. NumPy is one of the many open-source libraries that the 


Anaconda Python distribution installs. Operations on arrays are up to two orders of 


magnitude faster than those on lists. In a big-data world in which applications may do 
massive amounts of processing on vast amounts of array-based data, this performance 
advantage can be critical. According to Libraries .io, over 450 Python libraries 
depend on NumPy. Many popular data science libraries such as Pandas, SciPy 


(Scientific Python) and Keras (for deep learning) are built on or depend on NumPy. 


In this chapter, we explore array’s basic capabilities. Lists can have multiple 


dimensions. You generally process multi-dimensional lists with nested loops or list 





comprehensions with multiple for clauses. A strength of NumPy is “array-oriented 
programming,” which uses functional-style programming with internal iteration to 
make array manipulations concise and straightforward, eliminating the kinds of bugs 


that can occur with the external iteration of explicitly programmed loops. 


In this chapter’s Intro to Data Science section, we begin our multi-section introduction 
to the pandas library that you'll use in many of the data science case study chapters. Big 
data applications often need more flexible collections than NumPy’s arrays— 
collections that support mixed data types, custom indexing, missing data, data that’s 
not structured consistently and data that needs to be manipulated into forms 
appropriate for the databases and data analysis packages you use. We'll introduce 
pandas array-like one-dimensional Series and two-dimensional DataFrames and 
begin demonstrating their powerful capabilities. After reading this chapter, you'll be 
familiar with four array-like collections—lists, arrays, Series and DataFrames. 


We'll add a fifth—tensors—in the “Deep Learning” chapter. 


7.2 CREATING ARRAYS FROM EXISTING DATA 


The NumPy documentation recommends importing the numpy module as np so that 


you can access its members with "np.": 


In [1]: import numpy as np 


The numpy module provides various functions for creating arrays. Here we use the 
array function, which receives as an argument an array or other collection of 
elements and returns a new array containing the argument’s elements. Let’s pass a 


list: 


lick here to view code image 


In [2]: numbers = np.array (([2, Sie in i lie) 


The array function copies its argument’s contents into the array. Let’s look at the 


type of object that function array returns and display its contents: 


lick here to view code image 


In [3]: type(numbers) 
Out [3]: numpy.ndarray 


In [4]: numbers 
outl array (i 2; Bi, Bie Tie ASE) 


Note that the type is numpy. ndarray, but all arrays are output as “array.” When 
outputting an array, NumPy separates each value from the next with a comma anda 
space and right-aligns all the values using the same field width. It determines the field 
width based on the value that occupies the largest number of character positions. In 
this case, the value 11 occupies the two character positions, so all the values are 
formatted in two-character fields. That’s why there’s a leading space between the [ and 
2s 


Multidimensional Arguments 


The array function copies its argument’s dimensions. Let’s create an array froma 


two-row-by-three-column list: 


lick here to view code image 


Ero Se npr arccay (TD 27 Silene Ae ei ula) 
Out LS] 
arcay OPEL 27 Sih; 

[4S 61) 


NumPy auto-formats arrays, based on their number of dimensions, aligning the 


columns within each row. 


7.3 ARRAY ATTRIBUTES 


An array object provides attributes that enable you to discover information about its 


structure and contents. In this section we'll use the following arrays: 


lick here to view code image 





Im [lj import numpy as np 
In [2]: integers = np.array([[1, ee cies | ae ie alii) 
In [3]: integers 
utlok 
array TLP 2n sul; 
[4, 5, 6]]) 
In [4]: floats = npsarray (10-0; Owl O52, 037 0-4 


ta [S]: seiloatss 
Oui one array on; omiy MA OR se 0.4]) 


NumPy does not display trailing Os to the right of the decimal point in floating-point 


values. 


Determining an array’s Element Type 


The array function determines an array’s element type from its argument’s elements. 


You can check the element type with an array’s dtype attribute: 


lick here to view code image 


In [6]: integers.dtype 
Ouelol=s dtype (Tintes) F 1nt32 on some platforms 


In [7]: floats.dtype 
OuG ITs deype( Vt loatie4s™ ) 


As you'll see in the next section, various array-creation functions receive a dtype 


keyword argument so you can specify an array’s element type. 


For performance reasons, NumPy is written in the C programming language and uses 
C’s data types. By default, NumPy stores integers as the NumPy type int 64 values— 
which correspond to 64-bit (8-byte) integers in C—and stores floating-point numbers 
as the NumPy type float 64 values—which correspond to 64-bit (8-byte) floating- 





point values in C. In our examples, most commonly you'll see the types int 64, 

floaté64, bool (for Boolean) and object for non-numeric data (such as strings). The 

complete list of supported types is at 
ttps://docs.scipy.org/doc/numpy/user/basics.types.html. 


Determining an array’s Dimensions 


The attribute ndim contains an array’s number of dimensions and the attribute 


shape contains a tuple specifying an array’s dimensions: 


lick here to view code image 


in [8]; 
Ome [Sal 
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integers.ndim 
2 


floats. ndrim 
il 


integers.shape 
(Zr) 


floats.shape 
(Sir) 





Here, integers has 2 rows and 3 columns (6 elements) and floats is one- 


dimensional, so snippet [11] shows a one-element tuple (indicated by the comma) 


containing 


floats’ number of elements (5). 





Determining an array’s Number of Elements and Element Size 


You can view an array’s total number of elements with the attribute size and the 


number of bytes required to store each element with itemsize: 


lick here to view code image 
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integers.size 
6 


integers.itemsiz # 4 if C compiler uses 32-bit Inte 
8 





floats. size 
5 


floats.itemsize 
8 





Note that integers’ size is the product of the shape tuple’s values—two rows of 


three elements each for a total of six elements. In each case, itemsize is 8 because 





integers contains int64 values and floats contains float64 values, which each 


occupy 8 bytes. 


lterating Through a Multidimensional array’s Elements 


You'll generally manipulate arrays using concise functional-style programming 
techniques. However, because arrays are iterable, you can use external iteration if 


you'd like: 


lick here to view code image 


In [16]: for row in integers: 
hor Collum ane COW: 
print(column, end=' UB) 


PEINE 


You can iterate through a multidimensional array as if it were one-dimensional by 


using its flat attribute: 


lick here to view code image 


Tma Tie Tor a any intesgers Elat: 


print(i, end=' S 


7.4 FILLING ARRAYS WITH SPECIFIC VALUES 


NumPy provides functions zeros, ones and full for creating arrays containing 0s, 
1s or a specified value, respectively. By default, zeros and ones create arrays 


containing float64 values. We’ll show how to customize the element type 





momentarily. The first argument to these functions must be an integer or a tuple of 
integers specifying the desired dimensions. For an integer, each function returns a one- 


dimensional array with the specified number of elements: 


lick here to view code image 


In [1]: import numpy as np 


inl: mew zierosi(5)) 
OuelAls: array Cl) 0:7 Ores One, Onn, 0-1) 


For a tuple of integers, these functions return a multidimensional array with the 
specified dimensions. You can specify the array’s element type with the zeros and 


ones function’s dt ype keyword argument: 


lick here to view code image 


In [3]: np.ones((2, 4), dtype=int) 


The array returned by full contains elements with the second argument’s value and 


type: 





lick here to view code image 


In (PA npo Curi cen) TS) 

Out AI: 

array Ol [ike Tor B LS B] 
Eesm isi iste ker E, 
Sr r ir o Tg 


7.5 CREATING ARRAYS FROM RANGES 


NumPy provides optimized functions for creating arrays from ranges. We focus on 
simple evenly spaced integer and floating-point ranges, but NumPy also supports 


nonlinear ranges. * 


ttps://docs.scipy.org/doc/numpy/reference/routines.array- 


reation.html. 


Creating Integer Ranges with arange 


Let’s use NumPy’s arange function to create integer ranges—similar to using built-in 
function range. In each case, arange first determines the resulting array’s number 
of elements, allocates the memory, then stores the specified range of values in the 


array: 


lick here to view code image 


In Vij: i2mpore numpy as np 


In [2]: np.arange (5) 
Oui PA array OG Ii 2 3; Hp 


En [Sj np-arange(o, 10) 
Ouelolk array (For o Im Sr SIN 


im [4]2 np- -aranges (10, i, —2) 
Ouie [4 array (T10; 87 6; 4, 219 





Though you can create arrays by passing ranges as arguments, always use arange 
as it’s optimized for arrays. Soon we'll show how to determine the execution time of 


various operations so you can compare their performance. 


Creating Floating-Point Ranges with 1 inspace 


You can produce evenly spaced floating-point ranges with NumPy’s linspace 
function. The function’s first two arguments specify the starting and ending values in 
the range, and the ending value is included in the array. The optional keyword 
argument num specifies the number of evenly spaced values to produce—this 


argument’s default value is 50: 


lick here to view code image 


in (SJ np- linspaceĖroL o, 120, 5 num=5) 
Oui Salt array O Os a OZ SiO. Sep O75), asec seh) 


Reshaping an array 


You also can create an array from a range of elements, then use array method 
reshape to transform the one-dimensional array into a multidimensional array. Let’s 
create an array containing the values from 1 through 20, then reshape it into four 


rows by five columns: 


lick here to view code image 


In [6]: np.arange(1l, 21).reshape(4, 5) 


Outo: 

arreay (I I 2 oy A Slm 
Paan re e S CINCH 
T e r e S 
Gee Ee Lon 2o 


Note the chained method calls in the preceding snippet. First, arange produces an 


array containing the values 1-20. Then we call reshape on that array to get the 4- 


by-5 array that was displayed. 


You can reshape any array, provided that the new shape has the same number of 
elements as the original. So a six-element one-dimensional array can become a 3-by-2 
or 2-by-3 array, and vice versa, but attempting to reshape a 15-element array intoa 


4-by-4 array (16 elements) causes a ValueError. 


Displaying Large arrays 


When displaying an array, if there are 1000 items or more, NumPy drops the middle 
rows, columns or both from the output. The following snippets generate 100,000 
elements. The first case shows all four rows but only the first and last three of the 
25,000 columns. The notation . . . represents the missing data. The second case shows 


the first and last three of the 100 rows, and the first and last three of the 1000 columns: 


lick here to view code image 


In [7]: np.arange(1, 100001).reshape(4, 25000) 


Outin: 

array (M 1; De 3m ar 249987 249997 2500017 
[2500 25002 25008 a 29998 99997 50000], 
1500017 50002, 500037 seer 74998; 749997 75000], 
[S00 150027 S00 Se eee 999987 9991907, 100000]]) 


In [8]: np.arange(1, 100001).reshape(100, 1000) 





Out lel: 

array (M iy Da Shoe ee 9985 9997 ROOM, 
[SOO 1002, INOS Fe ees eer 19987 19997 2000], 
[2 00Ns; 201027 2003m a cer 2999:73 29.997 3000], 
Te SOMOS © SHON” SUNOS Reema T EE 98000], 
9800 98002; S8003;, 2.5, 98998,7 989997 99000], 
[S900 T9002 OS 00S eee 999987 999997 100000]]) 


7.6 LIST VS. ARRAY PERFORMANCE: INTRODUCING 
6TIMEIT 


Most array operations execute significantly faster than corresponding list operations. 
To demonstrate, we'll use the [Python $timeit magic command, which times the 
average duration of operations. Note that the times displayed on your system may vary 


from what we show here. 


Timing the Creation of a List Containing Results of 6,000,000 Die Rolls 


We've demonstrated rolling a six-sided die 6,000,000 times. Here, let’s use the random 
module’s randrange function with a list comprehension to create a list of six million 
die rolls and time the operation using timeit. Note that we used the line- 


continuation character (\) to split the statement in snippet [2] over two lines: 


lick here to view code image 





Ta Me import random 
Ta [2l]: -timeit rolls list =~- 

[random.randrange (1, D tor ilin ange (0,2 6000000) 
629S TE IE orms per loop (mean t stedi dev. of 7 cuns, I loop each) 


By default, timeit executes a statement in a loop, and it runs the loop seven times. If 
you do not indicate the number of loops, st imeit chooses an appropriate value. In our 
testing, operations that on average took more than 500 milliseconds iterated only once, 


and operations that took fewer than 500 milliseconds iterated 10 times or more. 


After executing the statement, t imeit displays the statement’s average execution 
time, as well as the standard deviation of all the executions. On average, Stimeit 
indicates that it took 6.29 seconds (s) to create the list with a standard deviation of 119 
milliseconds (ms). In total, the preceding snippet took about 44 seconds to run the 


snippet seven times. 


Timing the Creation of an array Containing Results of 6,000,000 Die 
Rolls 


Now, let’s use the randint function from the numpy . random module to create an 


array of 6,000,000 die rolls 


lick here to view code image 


in [ei Import numpy as np 


in) [4] stamert. rolls array =- nparandomaranding(l, 7, 6 000) 000) 
72.4 ms + 635 ps per loop (mean + std. dev. of 7 runs, 10 loops each) 


On average, t imeit indicates that it took only 72.4 milliseconds with a standard 
deviation of 635 microseconds (us) to create the array. In total, the preceding snippet 


took just under half a second to execute on our computer—about 1/100th of the time 


snippet [2] took to execute. The operation is two orders of magnitude faster with 


array! 
60,000,000 and 600,000,000 Die Rolls 


Now, let’s create an array of 60,000,000 die rolls: 


lick here to view code image 


Ene | [Sis stmest rolls array, = np-random. randint(i r a, 60 (0010 000) 
873 ms: 29.4 ms per loop (mean t std. dev: of 7 runs, 1 loop each) 


On average, it took only 873 milliseconds to create the array. 
Finally, let’s do 600,000,000 million die rolls: 


lick here to view code image 


in [oJ stimeau volls array = np. randomjnrandintid,. u 600 0007000) 
LOSI ss te 252 ms per loop (mean © std dev: Ob m nuns, I loop each) 


It took about 10 seconds to create 600,000,000 elements with NumPy vs. about 6 


seconds to create only 6,000,000 elements with a list comprehension. 


Based on these timing studies, you can see clearly why arrays are preferred over lists 
for compute-intensive operations. In the data science case studies, we'll enter the 
performance-intensive worlds of big data and AI. We'll see how clever hardware, 
software, communications and algorithm designs combine to meet the often enormous 


computing challenges of today’s applications. 


Customizing the %t imeit Iterations 


The number of iterations within each %timeit loop and the number of loops are 
customizable with the -n and -r options. The following executes snippet [ 4]’s 


statement three times per loop and runs the loop twice: * 
* For most readers, using $time its default settings should be fine. 


lick here to view code image 


En We: ‘staimeat ons =r? rollslarray =- np- -random randint i y, 6 000 0100) 
g5- 5 ms t 5732 ms per loop (msan t std. dey- Of 2 CUNS, 3 loops each) 








Other IPython Magics 


IPython provides dozens of magics for a variety of tasks—for a complete list, see the 


IPython magics documentation. ? Here are a few helpful ones: 


3 


ttp://ipython.readthedocs.io/en/stable/interactive/magics.html. 


e %1load to read code into IPython from a local file or URL. 

e %save to save snippets to a file. 

e run to execute a .py file from IPython. 

e %precision to change the default floating-point precision for [Python outputs. 
e %cd to change directories without having to exit IPython first. 


e %edit to launch an external editor—handy if you need to modify more complex 


snippets. 


e history to view a list of all snippets and commands you've executed in the 


current [Python session. 


7.7 ARRAY OPERATORS 


NumPy provides many operators which enable you to write simple expressions that 
perform operations on entire arrays. Here, we demonstrate arithmetic between 


arrays and numeric values and between arrays of the same shape. 


Arithmetic Operations with arrays and Individual Numeric Values 


First, let’s perform element-wise arithmetic with arrays and numeric values by using 
arithmetic operators and augmented assignments. Element-wise operations are applied 
to every element, so snippet [4] multiplies every element by 2 and snippet [5] cubes 


every element. Each returns a new array containing the result: 


lick here to view code image 


in [Lj import numpy as np 


In [2]: numbers = np.arange (1, 6) 


ta [S]: numbers 
Ouelslk array E 2 3, A; SN 


ta lA]: numbers * 2 
Oucl4 it array (I 2; A 6, 87 LOI 


in Sle numbers << 3 
Ou PSs array l 1; Bi; Day 64, 125) 
In [6]: numbers # numbers is unchanged by the arithmetic operators 





ouelel array E 2 3 A Onl) 


Snippet [6] shows that the arithmetic operators did not modify numbers. Operators + 


and * are commutative, so snippet [4] could also be written as 2 * numbers. 
Augmented assignments modify every element in the left operand. 


lick here to view code image 


In [7]: numbers += 10 


In [8]: numbers 
Ou Ii arrcay (el 2 aS a Sil) 


Broadcasting 


Normally, the arithmetic operations require as operands two arrays of the same size 
and shape. When one operand is a single value, called a scalar, NumPy performs the 
element-wise calculations as if the scalar were an array of the same shape as the other 
operand, but with the scalar value in all its elements. This is called broadcasting. 
Snippets [4], [5] and [7] each use this capability. For example, snippet [4] is 


equivalent to: 
lick here to view code image 
namber s ea ua 25 By eaiay 
Broadcasting also can be applied between arrays of different sizes and shapes, 


enabling some concise and powerful manipulations. We’ll show more examples of 


broadcasting later in the chapter when we introduce NumPy’s universal functions. 


Arithmetic Operations Between arrays 


You may perform arithmetic operations and augmented assignments between arrays 
of the same shape. Let’s multiply the one-dimensional arrays numbers and 


numbers2 (created below) that each contain five elements: 


lick here to view code image 


In [9]: numbers2 = np.linspace(1.1, DR T eh) 


In [10]: numbers2 
ouek ole tarra y OP mIn 2n Bar AWA; Solo) 


In [11]: numbers * numbers2 
Out MINS arrtay (T esl 26:4; AD 9, 61.6; 82 %.5i])) 


The result is a new array formed by multiplying the arrays element-wise in each 
operand—11 * 1.1,12 * 2.2,13* 3.3, etc. Arithmetic between arrays of integers 


and floating-point numbers results in an array of floating-point numbers. 


Comparing arrays 


You can compare arrays with individual values and with other arrays. Comparisons 
are performed element-wise. Such comparisons produce arrays of Boolean values in 


which each element’s True or False value indicates the comparison result: 


lick here to view code image 


In [12]: numbers 
out (ale) array bin ale Vales alae S) 








in [13]: numbers >= 13 

Out[13]: array([False, False, True, True, True]) 
In [14]: numbers2 

Quta array T teal 2927 SS AAT SS 

In [15]: numbers2 < numbers 

outis]: array (| True, True, True, True, True] ) 
In [16]: numbers == numbers2 

Out[16]: array([False, False, False, False, False]) 
In [17]: numbers == numbers 

One Ess arroyviG True, True, True, True, True]) 





Snippet [13] uses broadcasting to determine whether each element of numbers is 
greater than or equal to 13. The remaining snippets compare the corresponding 


elements of each array operand. 


7.8 NUMPY CALCULATION METHODS 


An array has various methods that perform calculations using its contents. By default, 
these methods ignore the array’s shape and use all the elements in the calculations. 
For example, calculating the mean of an array totals all of its elements regardless of its 
shape, then divides by the total number of elements. You can perform these 
calculations on each dimension as well. For example, in a two-dimensional array, you 


can calculate each row’s mean and each column’s mean. 
Consider an array representing four students’ grades on three exams: 


lick here to view code image 


Im Lll: import numpy as np 


In [2]: grades = np.array([[8/7, DG LOM TOO Sis 207 
: [94, Ci DOs TL00 Sele S251) 118) 


In [3]: grades 
Outil: 
877 967 TO, 
OO e S0 
gA thes S0 
KOOP E E EA 


We can use methodes to calculate sum, min, max, mean, std (standard deviation) and 


var (variance)—each is a functional-style programming reduction: 


lick here to view code image 


In [4]: grades.sum() 
Out [4]: 1054 


in Sl: .gquades:.man() 
Cut ESk: AO 
In [6]: grades.max () 


Oui kelk: LOO 





In [7]: grades.mean () 


OUE IIS 385-8333 3333333333 


rn, (kes grcades. std) 
Out ke Vo 1 O28 574 G27 399189 


In [9]: grades.var () 
Cur kol 47.3 0S Spo 5550 50 5:6 





Calculations by Row or Column 


Many calculation methods can be performed on specific array dimensions, known as 
the array’s axes. These methods receive an axis keyword argument that specifies 
which dimension to use in the calculation, giving you a quick way to perform 


calculations by row or column in a two-dimensional array. 


Assume that you want to calculate the average grade on each exam, represented by the 
columns of grades. Specifying axis=0 performs the calculation on all the row values 


within each column: 


lick here to view code image 


In [10]: grades.mean (axis=0) 
Ou PLOWS array (195-257 Soe 5. 837 le) 


So 95.25 above is the average of the first column’s grades (87, 100, 94 and 100), 
85.25 is the average of the second column’s grades (96, 87, 77 and 81) and 83 is the 
average of the third column’s grades (70, 90, 90 and 82). Again, NumPy does not 
display trailing Os to the right of the decimal point in ' 83. '. Also note that it does 
display all element values in the same field width, which is why '83. ' is followed by 


two spaces. 


Similarly, specifying axis=1 performs the calculation on all the column values within 
each individual row. To calculate each student’s average grade for all exams, we can 


use: 


lick here to view code image 


In [11]: grades.mean (axis=1) 
Oum MAn array oA ssS3g3837 I2ESSeSs535q0 eu. i 87.66666667]) 


This produces four averages—one each for the values in each row. So 84. 33333333 is 


the average of row 0’s grades (87, 96 and 70), and the other averages are for the 


remaining rows. 


NumPy arrays have many more calculation methods. For the complete list, see 


ttps://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html 


7.9 UNIVERSAL FUNCTIONS 


NumPy offers dozens of standalone universal functions (or ufuncs) that perform 
various element-wise operations. Each performs its task using one or two array or 
array-like (such as lists) arguments. Some of these functions are called when you use 


operators like + and * on arrays. Each returns a new array containing the results. 


Let’s create an array and calculate the square root of its values, using the sqrt 


universal function: 


lick here to view code image 


In LI]: import numpy as np 

In [2]: numbers = np.array([1, A 97 16, 25, 361 
In [3]: np.sgqrt (numbers) 

ONCE RSH array Oor aa Sing iG So melee) 





Let’s add two arrays with the same shape, using the add universal function: 


lick here to view code image 


In [4]: numbers2 = np.arange(l, Te NO 


In [5]: numbers2 
Ou lSillt “array (iO; 20; 5305. 407. 5.0), 260i) 


In [6]: np.add(numbers, numbers2) 
OMG hGlkS  arcay (HI 247 397 56, 157 2619 





The expression np.add (numbers, numbers2) is equivalent to: 


numbers + numbers2 


Broadcasting with Universal Functions 


Let’s use the multiply universal function to multiply every element of numbers2 


by the scalar value 5: 


lick here to view code image 


Im [7]? npsmultapilly(numbers2, 3) 
Ou | arcay i 507 100,7 150, 2007 2507 2001M 


The expression np.multiply(numbers2, 5) is equivalent to: 


numbers2 * 5 


Let’s reshape numbers2 into a 2-by-3 array, then multiply its values by a one- 


dimensional array of three elements: 


lick here to view code image 





In [8]: numbers3 = numbers2.reshape(2, 3) 
In [9]: numbers3 
Oui ols 
array CILLO 20; S07 

om SOF es SON.) 
In [10]: numbers4 = np.array([2, A T G 
In [11]: np.multiply(numbers3, numbers) 
ut PrI: 
array Oii 207 20, 180], 


[e0 2007 a60) 


This works because numbers4 has the same length as each row of numbers3, so 
NumPy can apply the multiply operation by treating numbers4 as if it were the 


following array: 


array Glee 4 Gil; 
[27 47 Gil) 


If a universal function receives two arrays of different shapes that do not support 


broadcasting, a ValueError occurs. You can view the broadcasting rules at: 


ttps://docs.scipy.org/doc/numpy/user/basics.broadcasting.html 


Other Universal Functions 


The NumPy documentation lists universal functions in five categories—math, 
trigonometry, bit manipulation, comparison and floating point. The following table lists 
some functions from each category. You can view the complete list, their descriptions 


and more information about universal functions at: 


ttps://docs.scipy.org/doc/numpy/reference/ufuncs.html 


NumPy universal functions 





Math—add, subtract, multiply, divide, remainder, exp, log, sqrt, 


power, and more. 


Trigonometry—sin, cos, tan, hypot, arcsin, arccos, arctan, and 


more. 


Bit manipulation—bitwise and,bitwise or,bitwise xor, invert, 


left_shiftandright_shift-. 


Comparison—greater, greater equal, less, less equal’, equal, 
not eguel, logical anc, Logical or=, logical zor, logical not, 


minimum, maximum, and more. 


Floating point—floor, ceil, isinf, isnan, fabs, trunc, and more. 


7.10 INDEXING AND SLICING 


One-dimensional arrays can be indexed and sliced using the same syntax and 
techniques we demonstrated in the “Sequences: Lists and Tuples” chapter. Here, we 


focus on array-specific indexing and slicing capabilities. 


Indexing with Two-Dimensional arrays 


To select an element in a two-dimensional array, specify a tuple containing the 


element’s row and column indices in square brackets (as in snippet [4] ): 


lick here to view code image 


In [1]: import numpy as np 


In [2]: grades = np.array([[8/7, Gore Ole [AKO Or wey eo Orlie 
; [94, ATi AO S(O NO a eral Za) 


In [3]: grades 
Outs: 
array (I Sie 96, Wali 
LOO Si, 10; 

See i SOU 


LOO; 8h, 821) 


In [4]: grades[0, 1] + cow 0, column i 
Ome Ais 96 


Selecting a Subset of a Two-Dimensional array’s Rows 


To select a single row, specify only one index in square brackets: 


lick here to view code image 


In [5]: grades[1] 
Ouie Laie array (FLO, Sih 90] ) 


To select multiple sequential rows, use slice notation: 


lick here to view code image 


rn lel: ormades: (Or 24) 

Outlok 

arcay O eos, 96, HOT; 
[LOOF aS Hii aa S One) 


To select multiple non-sequential rows, use a list of row indices: 


in Wit grades ii, 34] 

Out: 

array (LOO; en a SON; 
LOO si <8) 110) 


Selecting a Subset of a Two-Dimensional array’s Columns 


You can select subsets of the columns by providing a tuple specifying the row(s) and 
column(s) to select. Each can be a specific index, a slice or a list. Let’s select only the 


elements in the first column: 


lick here to view code image 


in lel: gradesi?, 0] 
OURS: array (O e7 LOOT, 94, 1007) 


The 0 after the comma indicates that we’re selecting only column 0. The: before the 
comma indicates which rows within that column to select. In this case, : is a slice 
representing all rows. This also could be a specific row number, a slice representing a 


subset of the rows or a list of specific row indices to select, as in snippets [5]-—[7]. 


You can select consecutive columns using a slice: 


in [Pt -oradess,. 133] 
Out Mo: 
arcay (Soy. LON; 

L877 9097 

a Oe, 

LSI e2) 


or specific columns using a list of column indices: 


m [10]: grades: 10, 21] 
Oue [LON]: 
arcay O oa a EON, 

POOF SON, 

[ 94, 90], 

MOO; 825) i) 


7.11 VIEWS: SHALLOW COPIES 


The previous chapter introduced view objects—that is, objects that “see” the data in 
other objects, rather than having their own copies of the data. Views are shallow copies. 


Various array methods and slicing operations produce views of an array’s data. 


The array method view returns a new array object with a view of the original array 


object’s data. First, let’s create an array and a view of that array: 


lick here to view code image 


in [is import numpy as np 
In [2]: numbers = np.arange(l1, 6) 
In [3]: numbers 


Ome (Sil array Gin 27 3,7 47. 51N 





In [4]: numbers2 = numbers.view() 


In [5]: numbers2 
Omics oles array Gel 2 S A SIN 





We can use the built-in id function to see that numbers and numbers2 are different 


objects: 


In [6]: id(numbers) 
ütle]: 4462958592 


in [js irdínumbpersS2) 


Out [7]: 4590846240 


To prove that numbers2 views the same data as numbers, let’s modify an element in 


numbers, then display both arrays: 


lick here to view code image 


In [8]: numbers[1] *= 10 


In [9]: numbers2 
puelle array (r i 20, By 4, 51) 


In [10]: numbers 
Guelo arcay (l 1i 20, 37 4, Sil) 


Similarly, changing a value in the view also changes that value in the original array: 
In [11]: numbers2[1] /= 10 


In [12]: numbers 
Gutiz: array Mr 27 S7 Ae Sl 


Ta isis numbers? 
OU ES ee array L 27 Si Aa SN 


Slice Views 


Slices also create views. Lets make number s2 a slice that views only the first three 


elements of numbers: 


In [14]: numbers2 = numbers[0:3] 





In [15]: numbers2 
Quta areca Oi 27 S19 


Again, we can confirm that numbers and numbers2 are different objects with id: 


In [16]: id(numbers) 
Gut lie] 4462958592 


in [I7]: rd mumbers2) 
Out [17]: 4590848000 


We can confirm that numbers2 is a view of only the first three numbers elements by 


attempting to access numbers2 [3], which produces an IndexError: 


lick here to view code image 





IndexError 


ipython-input-18-582053f52daa> in <module>() 


----> 1 numbers2[3] 


(most recent call last 





IndexError: index 3 is out of bounds fror axis 0 wrieh size 3 














Now, let’s modify an element both arrays share, then display them. Again, we see that 


numbers2 is a view of numbers: 


lick here to view code image 


tanos numbers i *= 20 


In [20]: numbers 
Oui RAO array Ti 25) S7 4, oul) 


In [21]: numbers2 
Oui PZ array (I 1m 420, 310 


7.12 DEEP COPIES 


Though views are separate array objects, they save memory by sharing element data 
from other arrays. However, when sharing mutable values, sometimes it’s necessary 
to create a deep copy with independent copies of the original data. This is especially 
important in multi-core programming, where separate parts of your program could 


attempt to modify your data at the same time, possibly corrupting it. 


The array method copy returns a new array object with a deep copy of the original 


array object’s data. First, let’s create an array and a deep copy of that array: 


lick here to view code image 


En [amp oriks numMpy as np 
In [2]: numbers = np.arange(1, 6) 
In [3]: numbers 


OME LSS array O 2 Sie A S 


In [4]: numbers2 = numbers.copy() 





In [5]: numbers2 
Ouse array Hr 25.37.9455 5I) 





To prove that numbers2 has a separate copy of the data in numbers, let’s modify an 


elementin numbers, then display both arrays: 


lick here to view code image 


In [6]: numbers[1] *= 10 


In I7]: numbers 
Oni i li arrcay O 1 20; Bi; 4, Bills) 


In [8]: numbers2 
ouelelk array 0 i; 27 37 A, 519 





As you can see, the change appears only in numbers. 


Module copy—Shallow vs. Deep Copies for Other Types of Python 
Objects 


In previous chapters, we covered shallow copying. In this chapter, we’ve covered how 
to deep copy array objects using their copy method. If you need deep copies of other 


types of Python objects, pass them to the copy module’s deepcopy function. 


7.13 RESHAPING AND TRANSPOSING 


We've used array method reshape to produce two-dimensional arrays from one- 


dimensional ranges. NumPy provides various other ways to reshape arrays. 


reshape VS. resize 


The array methods reshape and resize both enable you to change an array’s 
dimensions. Method reshape returns a view (shallow copy) of the original array with 


the new dimensions. It does not modify the original array: 


lick here to view code image 


Im [1]: import numpy as np 
In [2]: grades = np.array([[8/7, SiGe TO TOOT aIr 20N 


In [3]: grades 
Outs: 
array OM e77 96, oJ 

[LOO en S0 





In [4]: grades.reshape(l, 6) 
Outre array (i o7 96; TO LOO ee 90815) 


In [5]: grades 
Out koli: 


areca O e77 96, HONG, 
TEO O Bi, 90T 


Method resize modifies the original array’s shape: 


lick here to view code image 


In [6]: grades.resize(l, 6) 


In [7]: grades 
Qut] array Gil 877 S96 107 100; 80,9011) 


flattenvs. ravel 


You can take a multidimensional array and flatten it into a single dimension with the 


methods flatten and ravel. Method : 


flatten deep copies the original array’s data: 





lick here to view code image 








In [8]: grades = np.array([[8/7, OG a Olle a (MOO 8 7 On als) 
In [9]: grades 
Gut lk: 
accay (ileal, 96; HOM 
OO Say 90 
TALO flattened = grades.flatten() 
iora | (ikl flattened 
Ome Mines array S7, SG OG L007 Sa, 90]) 
ra I2 grades 
Gur l2 
acrcay (Ti 87, 96; TONE 
EO; Sal, SOT 1h) 





To confirm that grades and 4 
of 





lick here to view code image 


In [13]: flattened[0] = 100 
In [14]: flattened 
Oui Al arrcay TL00, 96, OK 
in [os grades 
Oui ES] | 
array OM e77 96, HO, 
LOOF Onley 9.0018) 


flattened do not share the data, let’s modify an element 


flattened, then display both arrays: 


90]) 


Method ravel produces a view of the original array, which shares the grades 


array’s data: 


lick here to view code image 


aera, Tiel: 





raveled = grades.ravel() 


In [17]: raveled 
Outs Ale array (I 37; 96, TOG L00, E 90]) 


In [18]: grades 

Gut [T8] 

array OI e77 96, HO 
[LOO 8a. 10M) 


To confirm that grades and raveled share the same data, let’s modify an element of 


raveled, then display both arrays: 


lick here to view code image 


In [19]: raveled[0] = 100 


In [20]: raveled 
Out (ZA0l] array (I LOO, 96, TOG. 100, 87, 90]) 


In [21]: grades 
outlet: 
accay OLLELOO, 96, OT 

[e007 er SOc) 





Transposing Rows and Columns 


You can quickly transpose an array’s rows and columns—that is “flip” the array, so 


the rows become the columns and the columns become the rows. The T attribute 
returns a transposed view (shallow copy) of the array. The original grades array 
represents two students’ grades (the rows) on three exams (the columns). Let’s 

transpose the rows and columns to view the data as the grades on three exams (the 


rows) for two students (the columns): 


In [22]: grades.T 


Oui 22a 

arama va OOF = IOON, 
[967 col 
[207 900g) 


Transposing does not modify the original array: 


In [23]: grades 

Owe [23 Is 

array([[100, 96, O, 
RO T O 


Horizontal and Vertical Stacking 


You can combine arrays by adding more columns or more rows—known as horizontal 


stacking and vertical stacking. Let’s create another 2-by-3 array of grades: 


lick here to view code image 
In [24]: grades2 = np.array([[94, Te DOG al OlOr = Bil g2) 


Let’s assume grades2 represents three additional exam grades for the two students in 
the grades array. We can combine grades and grades2 with NumPy’s hstack 
(horizontal stack) function by passing a tuple containing the arrays to combine. 


The extra parentheses are required because hstack expects one argument: 


lick here to view code image 


in 25]: npahstack((gqrades;, qrades2))) 

Cu l2 Ss 

arcay (TLLOO T 96. TO TOA T SO Ik, 
CO E EO OO E AA 


Next, let’s assume that grades2 represents two more students’ grades on three exams. 
In this case, we can combine grades and grades2 with NumPy’s vstack (vertical 


stack) function: 


lick here to view code image 


In [26]: np.vstack((grades, grades2) ) 
Onis [ZAG]: 


arcay (OOF 967. Oil; 


OO 88 SO, 
gA ie, S10, 
MOOG obra e2] 


I 
[ 
I 
[ 
7.14 INTRO TO DATA SCIENCE: PANDAS SERTES AND 
DATAFRAMES 


NumPy’s array is optimized for homogeneous numeric data that’s accessed via integer 
indices. Data science presents unique demands for which more customized data 
structures are required. Big data applications must support mixed data types, 


customized indexing, missing data, data that’s not structured consistently and data that 


eeds to be manipulated into forms appropriate for the databases and data analysis 


packages you use. 


Pandas is the most popular library for dealing with such data. It provides two key 
collections that you'll use in several of our Intro to Data Science sections and 
throughout the data science case studies—Series for one-dimensional collections and 
DataFrames for two-dimensional collections. You can use pandas’ MultiIndex to 


manipulate multi-dimensional data in the context of Series and DataFrames. 


Wes McKinney created pandas in 2008 while working in industry. The name pandas is 
derived from the term “panel data,” which is data for measurements over time, such as 
stock prices or historical temperature readings. McKinney needed a library in which the 
same data structures could handle both time- and non-time-based data with support 
for data alignment, missing data, common database-style data manipulations, and 
more. 4 

4 McKinney, Wes. Python for Data Analysis: Data Wrangling with Pandas, NumPy, 
and IPython, pp. 123165. Sebastopol, CA: OReilly Media, 2018. 


NumPy and pandas are intimately related. Series and DataFrames use arrays 
“under the hood.” Series and DataFrames are valid arguments to many NumPy 
operations. Similarly, arrays are valid arguments to many Series and DataFrame 


operations. 


5 is over 2000 pages. In this 


Pandas is a massive topic—the PDF of its documentation 
and the next chapters’ Intro to Data Science sections, we present an introduction to 
pandas. We discuss its Series and DataFrames collections, and use them in support 
of data preparation. You'll see that Series and DataFrames make it easy for you to 
perform common tasks like selecting elements a variety of ways, filter/map/reduce 
operations (central to functional-style programming and big data), mathematical 


operations, visualization and more. 


> For the latest pandas documentation, see 


ttp://pandas.pydata.org/pandas-docs/stable/. 


7.14.1 pandas Series 


A Series is an enhanced one-dimensional array. Whereas arrays use only zero- 
based integer indices, Series support custom indexing, including even non-integer 


indices like strings. Series also offer additional capabilities that make them more 


convenient for many data-science oriented tasks. For example, Series may have 


missing data, and many Series operations ignore missing data by default. 


Creating a Series with Default Indices 


By default, a Series has integer indices numbered sequentially from o. The following 


creates a Series of student grades from a list of integers: 


lick here to view code image 


tTa his import pandas as pad 


In [2]: grades = pd.Series([87, 100, 941) 





The initializer also may be a tuple, a dictionary, an array, another Series ora single 


value. We'll show a single value momentarily. 


Displaying a Series 


Pandas displays a Series in two-column format with the indices left aligned in the left 
column and the values right aligned in the right column. After listing the Series 


elements, pandas shows the data type (dtype) of the underlying array’s elements: 


In [3]: grades 


Oinak 1|Re 

0 87 
1 100 
2 94 


dtype: int64 


Note how easy it is to display a Series in this format, compared to the corresponding 


code for displaying a list in the same two-column format. 


Creating a Series with All Elements Having the Same Value 


You can create a Series of elements that all have the same value: 


lick here to view code image 


In [4]: pd.Series(98.6, range(3)) 
Out MA: 

0 98.6 

il 98.6 


2 986 
dtype: float64 


The second argument is a one-dimensional iterable object (such as a list, an array or a 
range) containing the Series’ indices. The number of indices determines the number 


of elements. 


Accessing a Series’ Elements 
You can access a Series’s elements by via square brackets containing an index: 


In [5]: grades[0] 
Owe [Sa8s7 


Producing Descriptive Statistics for a Series 


Series provides many methods for common tasks including producing various 
descriptive statistics. Here we show count, mean, min, max and std (standard 


deviation): 


lick here to view code image 


in Jee: grades count) 
Owe hele 8 
In [7]: grades.mean () 


Out[7]: 93.66666666666667 





rn [8]: grades-min() 
Gut lell: 8T 
In [9]: grades.max () 


Out Ss 100 


im [0]: grades- std) 
Out[10]: 6.506407098647712 


Each of these is a functional-style reduction. Calling Series method describe 


produces all these stats and more: 


lick here to view code image 


In [11]: grades.describe() 
Owie Pa: 


count 
mean 
std 
min 
25% 
50% 
15% 
max 


dtype: 


The 25%, 50% and 75% are quartiles: 


2: 
93u 
Gis 
Cals 
90x 
94. 
ST 
100. 


000000 
666667 
506407 
000000 
500000 
000000 
000000 
000000 


floato64 


e 50% represents the median of the sorted values. 


e 25% represents the median of the first half of the sorted values. 


e 75% represents the median of the second half of the sorted values. 


For the quartiles, if there are two middle elements, then their average is that quartile’s 


median. We have only three values in our Series, so the 25% quartile is the average of 


87 and 94, and the 75% quartile is the average of 94 and 100. Together, the 


interquartile range is the 75% quartile minus the 25% quartile, which is another 


measure of dispersion, like standard deviation and variance. Of course, quartiles and 


interquartile range are more useful in larger datasets. 


Creating a Series with Custom Indices 


You can specify custom indices with the index keyword argument: 


lick here to view code image 


Take Vika) ee 


ioe iiss 
Owe MS]; 


Wally 





Eva 
Sam 


dtype: 


grades = pd.Series([87, 





grades 


87 
100 
94 


int64 


100, 94], index=['Wally' 





Pia) pe orem a 











In this case, we used string indices, but you can use other immutable types, including 


integers not beginning at o and nonconsecutive integers. Again, notice how nicely and 


concisely pandas formats a Series for display. 


Dictionary Initializers 


If you initialize a Series with a dictionary, its keys become the Series’ indices, and 


its values become the Series’ element values: 


lick here to view code image 


In [14]: grades = pd.Series({'Wally': toh pe elect te DOO Samna SA 








In [15]: grades 





Out Hro: 

Wally 87 
Eva 100 
Sam 94 


dtype: int64 


Accessing Elements of a Series Via Custom Indices 


In a Series with custom indices, you can access individual elements via square 


brackets containing a custom index value: 


In [le]: grades "Eva" ] 
Out[16]: 100 


If the custom indices are strings that could represent valid Python identifiers, pandas 
automatically adds them to the Series as attributes that you can access via a dot (. ), 


as in: 


In [17]: grades.Wally 
Out loys. Sn 


Series also has built-in attributes. For example, the dtype attribute returns the 


underlying array’s element type: 


In [18]: grades- -dtype 
out Mel cCiype(tanted™) 


and the values attribute returns the underlying array: 


lick here to view code image 


In [19]: grades.values 


Qut [Ss array Ch 877 100; 94]) 


Creating a Series of Strings 


Ifa Series contains strings, you can use its str attribute to call string methods on 


the elements. First, let’s create a Series of hardware-related strings: 


lick here to view code image 





In [20]: hardware = pd.Series(['Hammer', "Saw', 'Wrench']) 
In [21]: hardware 

Ole 2 H 

0 Hammer 

il Saw 

2 Wrench 


dtype: object 


Note that pandas also right-aligns string element values and that the dtype for strings 


is object. 


Let’s call string method contains on each element to determine whether the value of 


each element contains a lowercase 'a': 


lick here to view code image 


in [222 hardwareastxrc. contains (at) 
Qut [2215 

0 Prue 

1 True 

2 False 


dtype: bool 


Pandas returns a Series containing bool values indicating the contains method’s 
result for each element—the element at index 2 ('Wrench') does not contain an 'a', 
so its element in the resulting Series is False. Note that pandas handles the iteration 
internally for you—another example of functional-style programming. The str 
attribute provides many string-processing methods that are similar to those in Python’s 
string type. For a list, see: ttps://pandas.pydata.org/pandas- 
ocs/stable/api.html#string-handling. 


The following uses string method upper to produce a new Series containing the 


uppercase versions of each element in hardware: 


lick here to view code image 





In [23]: hardware.str.upper () 
Out P23 

0 HAMMER 

il SAW 

2 WRENCH 





dtype: object 


7.14.2 DataFrames 


A DataFrame is an enhanced two-dimensional array. Like Series, DataFrames can 
have custom row and column indices, and offer additional operations and capabilities 
that make them more convenient for many data-science oriented tasks. DataFrames 
also support missing data. Each column in a DataFrame is a Series. The Series 
representing each column may contain different element types, as you'll soon see when 


we discuss loading datasets into DataFrames. 


Creating a DataFrame from a Dictionary 


Let’s create a DataFrame from a dictionary that represents student grades on three 


exams: 


lick here to view code image 


ToM: importe pandas as pd 








tame grades diet = {a Wet allay 8 Kemr Shows NOs aea a ea So 
Same: [OA a, OU haemet e LOO Bike ea, 
TRODA: Eor Gor SS 


In [3]: grades = pd.DataFrame(grades dict) 


In [4]: grades 





Qut [4A]: 

Wally Eva Sam Katie Bob 
0 87 100 94 100 83 
ak 96 ST T 81 65 
2 710 90 90 82 85 


Pandas displays DataFrames in tabular format with the indices left aligned in the 


index column and the remaining columns’ values right aligned. The dictionary’s keys 
become the column names and the values associated with each key become the element 
values in the corresponding column. Shortly, we'll show how to “flip” the rows and 


columns. By default, the row indices are auto-generated integers starting from o. 


Customizing a DataFrame’s Indices with the i ndex Attribute 


We could have specified custom indices with the index keyword argument when we 


created the DataFrame, as in: 


lick here to view code image 
pa Datrakimame(gradcsmduch madek- Testi m Testi a rest 5.3) 


Let’s use the index attribute to change the DataFrame’s indices from sequential 


integers to labels: 


lick here to view code image 





In [5]: grades. index = ['Testl', 'Test2!, “Tests ] 


In [6]: grades 





Owe Le]: 

Wally Eva Sam Katie Bob 
Testl 87 100 94 100 83 
Test2 96 87 77 81 65 
Test3 70 90 90 82 85 


When specifying the indices, you must provide a one-dimensional collection that has 
the same number of elements as there are rows in the DataFrame; otherwise, a 
ValueError occurs. Series also provides an index attribute for changing an 


existing Series’ indices. 


Accessing a DataFrame’s Columns 


One benefit of pandas is that you can quickly and conveniently look at your data in 
many different ways, including selecting portions of the data. Let’s start by getting 


Eva’s grades by name, which displays her column as a Series: 





Im [9]: ogcades Evar] 
Out M: 
Testl 100 


Test2 87 
Test3 90 
Name: Eva, dtype: int64 





If a DataFrame’s column-name strings are valid Python identifiers, you can use them 


as attributes. Let’s get Sam's grades with the Sam attribute: 


In [8]: grades.Sam 
OWE lek 

Testl 94 

Test2 Ta 

Test3 90 


Name: Sam, dtype: int64 


Selecting Rows via the 1 oc and iloc Attributes 


Though DataFrames support indexing capabilities with [], the pandas documentation 
recommends using the attributes loc, iloc, at and iat, which are optimized to 
access DataFrames and also provide additional capabilities beyond what you can do 
only with []. Also, the documentation states that indexing with [] often produces a 
copy of the data, which is a logic error if you attempt to assign new values to the 


DataFrame by assigning to the result of the [] operation. 


You can access a row by its label via the DataFrame’s loc attribute. The following 


lists all the grades in the row 'Test1': 


lick here to view code image 


in [9]: grades. loci Testi] 





Cutie: 

Wally 87 

Eva 100 

Sam 94 

Katie 100 

Bob 83 

Name: Testl, dtype: int64 


You also can access rows by integer zero-based indices using the iloc attribute (the i 
in iloc means that it’s used with integer indices). The following lists all the grades in 


the second row: 


lick here to view code image 


ia lO ozadesaloc [il | 





Out MUON: 

Wally 96 

Eva 87 

Sam EY 

Katie 81 

Bob 65 

Name: Test2, dtype: int64 


Selecting Rows via Slices and Lists with the 1 oc and i1oc Attributes 


The index can be a slice. When using slices containing labels with loc, the range 


specified includes the high index ('Test3'): 


lick here to view code image 


En ie grades. Voc esti rests” 





Out ays 

Wally Eva Sam Katie Bob 
Testl 87 Ol 94 100 83 
Test2 96 87 ay 81 65 
Test3 70 90 90 82 85 


When using slices containing integer indices with iloc, the range you specify excludes 


the high index (2): 


lick here to view code image 


Ta Vi Ze grades a Moe Oa 2)] 





Guel 2k: 

Wally Eva Sam Katie Bob 
Testl e E TONO, 94 100 83 
Test2 96 87 I. 81 65 


To select specific rows, use a list rather than slice notation with loc or iloc: 


lick here to view code image 


in [lsc qrades loch testi’, “Testes 





Out Mro: 

Wally Eva Sam Katie Bob 
Testl 87 100 94 100 83 
Test3 70 90 90 82 85 


ta MA agrades ikoe iio; 21 


Out [14]: 

Wally Eva Sam Katie Bob 
Testl 87 100 94 100 83 
Test3 70 90 90 82 85 





Selecting Subsets of the Rows and Columns 


So far, we’ve selected only entire rows. You can focus on small subsets of a DataFrame 
by selecting rows and columns using two slices, two lists or a combination of slices and 


lists. 


Suppose you want to view only Eva’s and Katie’s grades on Test1 and Test2. We 
can do that by using loc with a slice for the two consecutive rows and a list for the two 


non-consecutive columns: 


lick here to view code image 








i | Se grades- loc > testi is "Vest2" I Eya“, Vette” |] 
our [Sis 
Eva Katie 
Test1l 100 100 
Test2 87 fenil 


The slice 'Test1':'Test2' selects the rows for Test1 and Test2. The list ['Eva', 


'Katie'] selects only the corresponding grades from those two columns. 


Let’s use iloc with a list and a slice to select the first and third tests and the first three 


columns for those tests: 


lick here to view code image 


PAG oracdesca loc | Ure 2] Oia 





Out Mio: 

Wally Eva Sam 
Testl 87 100 94 
Test3 70 90 90 


Boolean Indexing 


One of pandas’ more powerful selection capabilities is Boolean indexing. For 


example, let’s select all the A grades—that is, those that are greater than or equal to 90: 


lick here to view code image 


In [17]: grades[grades >= 90] 





one Man: 

Wally Eva Sam Katie Bob 
Test1 NaN 100.0 94.0 100.0 NaN 
Test2 96.0 NaN NaN NaN NaN 
Test3 NaN DOO F020 NaN NaN 


Pandas checks every grade to determine whether its value is greater than or equal to 90 
and, if so, includes it in the new DataFrame. Grades for which the condition is False 
are represented as NaN (not a number) in the new DataFrame. NaN is pandas’ 


notation for missing values. 
Let’s select all the B grades in the range 80-89: 


lick here to view code image 


In [18]: grades (grades >= 80) & (grades < 90) ] 
Owe LES: 
Wally Eva Sam Katie Bob 
Testl Shea 0) NaN NaN NaN 83.0 
Test2 NaN 87.0 NaN 3170 NaN 
Test3 NaN NaN NaN 8220! 85210 





Pandas Boolean indices combine multiple conditions with the Python operator « 
(bitwise AND), not the and Boolean operator. For or conditions, use | (bitwise OR). 
NumPy also supports Boolean indexing for arrays, but always returns a one- 


dimensional array containing only the values that satisfy the condition. 


Accessing a Specific Dat aFrame Cell by Row and Column 


You can use a DataFrame’s at and iat attributes to get a single value from a 
DataFrame. Like loc and iloc, at uses labels and iat uses integer indices. In each 
case, the row and column indices must be separated by a comma. Let’s select Eva’s 
Test2 grade (87) and Wally’s Test3 grade (70) 


lick here to view code image 


in Mols oradescatl Test21, “ava | 
Oute Oe eT 





Ta 120]: grades. rat il2; 0l 
Oue t20 70 


You also can assign new values to specific elements. Let’s change Eva’s Test2 grade to 


100 using at, then change it back to 87 using iat: 


lick here to view code image 


in [ilk grades aci Test21 "Eval = r00 





in 22 gradgs-acl Test21; Eva] 
Oue N22 10:0 


To eslile grades- ratili; 2i is} 


Ene [2A orades racli; 2] 
Out [24 1875.0 





Descriptive Statistics 


Both Series and DataFrames have a describe method that calculates basic 
descriptive statistics for the data and returns them as a DataFrame. Ina DataFrame, 
the statistics are calculated by column (again, soon you'll see how to flip rows and 


columns): 


lick here to view code image 


In [25]: grades.describe() 





Cumi Si: 

Wally Eva Sam Katie Bob 
count 3.000000 3.000000 3.000000 3.000000 3.000000 
mean 84.333333 92n 333333 8-000000 87.666667 77.666667 
Sid IS eZ08535 6.806859 8.888194 TOREIZ GNT TES OTS EAT 
min 70.000000 87.000000 77.000000 81.000000 65.000000 
25% 78.500000 88.500000 83.500000 81.500000 74.000000 
50% 87.000000 90.000000 90.000000 82.000000 83.000000 
715% 91.500000 95.000000 92.000000 91.000000 84.000000 
max 96.000000 100.000000 94.000000 100.000000 85.000000 


As you can see, describe gives you a quick way to summarize your data. It nicely 
demonstrates the power of array-oriented programming with a clean, concise 
functional-style call. Pandas handles internally all the details of calculating these 
statistics for each column. You might be interested in seeing similar statistics on test- 
by-test basis so you can see how all the students performs on Tests 1, 2 and 3—we’'ll 
show how to do that shortly. 


By default, pandas calculates the descriptive statistics with floating-point values and 


displays them with six digits of precision. You can control the precision and other 


default settings with pandas’ set_option function: 


lick here to view code image 


mol: parsec option (i precision t, 2) 


In [27]: grades.describe() 





cutz: 

Wally Eva Sam Katie Bob 
count 3.00 3.00 3-00 300 3010 
mean 84.33 9233 87-00 826s ERS 
std 13-210 6.81 8 89 TOGS PI 02 
min 70.00 87200 77.010 81200 65.00 
25% 78.50 88250 83250 81.50 74.00 
50% 87.00 902.00 90.010 82.00" 183:20i0 
715% SIO 95 007 92010 91.00 84.00 
max 96.00 100.00 94.00 100.00 85.00 


For student grades, the most important of these statistics is probably the mean. You can 


calculate that for each student simply by calling mean on the DataFrame: 


In [28]: grades.mean () 


Gur 2 8J 

Wally 8433 
Eva 92T 
Sam SORO) 
Katie EST 
Bob TIE GT 





dtype: float64 


In a moment, we'll show how to get the average of all the students’ grades on each test 


in one line of additional code. 


Transposing the DataFrame with the T Attribute 


You can quickly transpose the rows and columns—so the rows become the columns, 


and the columns become the rows—by using the T attribute: 


lick here to view code image 


In [29]: grades.T 
Out [29 es 

Testl Test2 Test3 
Wally Oa 96 70 
Eva 100 87 90 





Sam 94 qa 90 
Katie 100 81 82 
Bob 83 65 85 





T returns a transposed view (not a copy) of the DataFrame. 


Let’s assume that rather than getting the summary statistics by student, you want to get 


them by test. Simply call describe on grades.T, asin: 


lick here to view code image 





In [30]: grades.T.describe() 


outis oi 

Testl Test2 Test3 
count Bre ONO DEO 500 
mean 92.80 8.20 83.410 
std E6 11754 8:23 
min 83.070 65007000 
25% 87010 77:00 8200 
50% 94.00 81.00 85.00 
75% 100700 8720.0" T9000 
max 100700 m9600 T9000 


To see the average of all the students’ grades on each test, just call mean on the T 


attribute: 


In [31]: grades- k mean () 
Cur (Says 

Test1 Syne! 

Test2 SITZ 

Test3 83.4 


dtype: float64 


Sorting by Rows by Their Indices 


You'll often sort data for easier readability. You can sort a DataF rame by its rows or 
columns, based on their indices or values. Let’s sort the rows by their indices in 
descending order using sort_index and its keyword argument ascending=False 
(the default is to sort in ascending order). This returns a new DataFrame containing 
the sorted data: 


lick here to view code image 


En [32 | gradesksort imdexlascending- False) 





Qutis2Jk 
Wally Eva Sam Katie Bob 





Testa 7O 90 90 82 85 
Re sieZ 96 87 Uy 81 65 
Testi 87 100 94 100 83 


Sorting by Column Indices 


Now let’s sort the columns into ascending order (left-to-right) by their column names. 
Passing the axis=1 keyword argument indicates that we wish to sort the column 


indices, rather than the row indices—axis=0 (the default) sorts the row indices: 


lick here to view code image 








En [38] grades. sort andex(axis=1) 
Outi 

Bob Eva Katie Sam Wally 
Testl 83 100 100 94 87 
Test2 65 87 81 ey 96 
Test3 85 90 82 90 70 


Sorting by Column Values 


Let’s assume we want to see Test 1’s grades in descending order so we can see the 
students’ names in highest-to-lowest grade order. We can call the method 


sort_values as follows: 


lick here to view code image 








in | [34 | gradesrsort values (by Teseli; axis=1, ascending=False) 
Onis isal: 
Eva Katie Sam Wally Bob 
Test1l 100 100 94 87 83 
Test2 87 81 77 96 65 
Test3 90 82 90 70 85 


The by and axis keyword arguments work together to determine which values will be 


sorted. In this case, we sort based on the column values (axis=1) for Test1. 


Of course, it might be easier to read the grades and names if they were in a column, so 
we can sort the transposed DataFrame instead. Here, we did not need to specify the 


axis keyword argument, because sort values sorts data in a specified column by 


default: 


lick here to view code image 








in LSS Gracdestas ome i values (by = restir ascending=False) 
Out S5]: 
Testl Test2 Test3 
Eva 100 87 90 
Katie 100 81 82 
Sam 94 a 90 
Wally 87 96 70 
Bob 83 65 85 





Finally, since you're sorting only Test1’s grades, you might not want to see the other 


tests at all. So, let’s combine selection with sorting: 


lick here to view code image 











in sl: grades: loci Testik sort values (ascendung-ralise) 
Gur loe] 

Katie 100 

Eva 100 

Sam 94 

Wally 87 

Bob 83 





Name: Testl, dtype: int64 


Copy vs. In-Place Sorting 


By default the sort index and sort values return a copy of the original 
DataFrame, which could require substantial memory in a big data application. You can 
sort the DataFrame in place, rather than copying the data. To do so, pass the keyword 


argument inplace=True to either sort indexorsort values. 


We've shown many pandas Series and DataFrame features. In the next chapter’s 
Intro to Data Science section, we'll use Series and DataFrames for data munging— 


cleaning and preparing data for use in your database or analytics software. 


7.15 WRAP-UP 


This chapter explored the use of NumPy’s high-performance ndarrays for storing and 
retrieving data, and for performing common data manipulations concisely and with 
reduced chance of errors with functional-style programming. We refer to ndarrays 


simply by their synonym, arrays. 


he chapter examples demonstrated how to create, initialize and refer to individual 
elements of one- and two-dimensional arrays. We used attributes to determine an 
array’s size, shape and element type. We showed functions that create arrays of os, 
1s, specific values or ranges values. We compared list and array performance with the 


IPython timeit magic and saw that arrays are up to two orders of magnitude faster. 


We used array operators and NumPy universal functions to perform element-wise 
calculations on every element of arrays that have the same shape. You also saw that 
NumPy uses broadcasting to perform element-wise operations between arrays and 
scalar values, and between arrays of different shapes. We introduced various built-in 
array methods for performing calculations using all elements of an array, and we 
showed how to perform those calculations row-by-row or column-by-column. We 
demonstrated various array slicing and indexing capabilities that are more powerful 
than those provided by Python’s built-in collections. We demonstrated various ways to 
reshape arrays. We discussed how to shallow copy and deep copy arrays and other 
Python objects. 


In the Intro to Data Science section, we began our multisection introduction to the 
popular pandas library that you'll use in many of the data science case study chapters. 
You saw that many big data applications need more flexible collections than NumPy’s 
arrays, collections that support mixed data types, custom indexing, missing data, data 
that’s not structured consistently and data that needs to be manipulated into forms 


appropriate for the databases and data analysis packages you use. 


We showed how to create and manipulate pandas array-like one-dimensional Series 
and two-dimensional DataFrames. We customized Series and DataFrame indices. 
You saw pandas’ nicely formatted outputs and customized the precision of floating- 
point values. We showed various ways to access and select data in Series and 
DataFrames. We used method describe to calculate basic descriptive statistics for 


Series and DataFrames. We showed how to transpose DataFrame rows and 





columns via the T attribute. You saw several ways to sort DataFrames using their index 
values, their column names, the data in their rows and the data in their columns. You’re 
now familiar with four powerful array-like collections—lists, arrays, Series and 

DataFrames—and the contexts in which to use them. We'll add a fifth—tensors—in the 


“Deep Learning” chapter. 


In the next chapter, we take a deeper look at strings, string formatting and string 
methods. We also introduce regular expressions, which we'll use to match patterns in 


text. The capabilities you'll see will help you prepare for the “ atural Language 


rocessing (NLP)” chapter and other key data science chapters. In the next chapter’s 
Intro to Data Science section, we'll introduce pandas data munging—preparing data for 
use in your database or analytics software. In subsequent chapters, we'll use pandas for 


basic time-series analysis and introduce pandas visualization capabilities. 


https://avxhm.se/blogs/hillO 


. Strings: A Deeper Look 


Objectives 

In this chapter, you'll: 

m Understand text processing. 

mw Use string methods. 

m Format string content. 

m Concatenate and repeat strings. 

m Strip whitespace from the ends of strings. 

m Change characters from lowercase to uppercase and vice versa. 
m Compare strings with the comparison operators. 

m Search strings for substrings and replace substrings. 

mw Split strings into tokens. 

mw Concatenate strings into a single string with a specified separator between items. 


m Create and use regular expressions to match patterns in strings, replace substrings 


and validate data. 
m Use regular expression metacharacters, quantifiers, character classes and grouping. 


m Understand how critical string manipulations are to natural language processing. 





m Understand the data science terms data munging, data wrangling and data cleaning, 


and use regular expressions to munge data into preferred formats. 


Outline 
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.2.4 String’s format Method 





-3 Concatenating and Repeating Strings 
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.5 Changing Character Case 
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.8 Replacing Substrings 

-9 Splitting and Joining Strings 
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.11 Raw Strings 


.12 Introduction to Regular Expressions 





.12.1 re Module and Function fullmatch 


.12.2 Replacing Substrings and Splitting Strings 


.12.3 Other Search Functions; Accessing Matches 


.13 Intro to Data Science: Pandas, Regular Expressions and Data Munging 


.14 Wrap-Up 


.1 INTRODUCTION 


We've introduced strings, basic string formatting and several string operators and 
methods. You saw that strings support many of the same sequence operations as lists 
and tuples, and that strings, like tuples, are immutable. Now, we take a deeper look at 
strings and introduce regular expressions and the re module, which we'll use to match 
patterns * in text. Regular expressions are particularly important in today’s data rich 


applications. The capabilities presented here will help you prepare for the“ atural 


chapter, we'll look at other ways to have computers manipulate and even “understand” 
text. The table below shows many string-processing and NLP-related applications. In 
the Intro to Data Science section, we briefly introduce data 


cleaning/munging/wrangling with Pandas Series and DataFrames. 


* Well see in the data science case study chapters that searching for patterns in text is a 


crucial part of machine learning. 


String and 
NLP 


applications 





Anagrams 


Automated 
grading of 


written 


Inter-language translation 
homework uag 


Legal document 


Automated Spam classification 


: reparation 
teaching systems prep 

ee : } Speech-to-text engines 
Monitoring social media 


Categorizing 
articles posts Spell checkers 
A Natural language Steganography 


understanding- 


Compilers and Text editors 


Opinion analysis 


interpreters Page-composition software Text-to-speech engines 


Creative writing Palindromes Web scraping 
Cryptography Parts-of-speech tagging Who authored 
Shakespeare’s works? 

Document Project Gutenberg free 
classification books Word clouds 
Document Reading books, articles, Word games 
similarity documentation and 

absorbing knowledge Writing medical 
Document diagnoses from x-rays, 
summarization Search engines scans, blood tests 
Electronic book Sentiment analysis and many more 
readers 


Fraud detection 


Grammar 


checkers 


8.2 FORMATTING STRINGS 


Proper text formatting makes data easier to read and understand. Here, we present 


many text-formatting capabilities. 


8.2.1 Presentation Types 


You’ve seen basic string formatting with f-strings. When you specify a placeholder for a 
value in an f-string, Python assumes the value should be displayed as a string unless 
you specify another type. In some cases, the type is required. For example, let’s format 
the float value 17.489 rounded to the hundredths position: 





I HAL) ae Vea y)e= 2U oie ies eam tl 
Oute ae aT Ao" 


Python supports precision only for floating-point and Decimal values. Formatting is 





type dependent—if you try to use . 2f to format a string like 'hello',aValueError 








occurs. So the presentation type f in the format specifier .2f is required. It 
indicates what type is being formatted so Python can determine whether the other 
formatting information is allowed for that type. Here, we show some common 


presentation types. You can view the complete list at 


EEpS://docs.python.org/3/library/string. htmljformatspec 


Integers 


The d presentation type formats integer values as strings: 


ma Wea Sse eee ore ee 
Oute E PaO 


There also are integer presentation types (b, o and x or X) that format integers using 


the binary, octal or hexadecimal number systems. * 


* See the online appendix Number Systems for information about the binary, octal and 


hexadecimal number systems. 


Characters 


The c presentation type formats an integer character code as the corresponding 


character: 


lick here to view code image 


Ao Sue Be EE AE 
Out Pk CAT al 


Strings 

The s presentation type is the default. If you specify s explicitly, the value to format 
must be a variable that references a string, an expression that produces a string or a 
string literal, as in the first placeholder below. If you do not specify a presentation type, 
as in the second placeholder below, non-string values like the integer 7 are converted to 


strings: 


lick here to view code image 


ton Pee eae Oi Minysclibe Mercy su iyi iiel 
Ou [4] 2 “helio 7! 


In this snippet, "hello" is enclosed in double quotes. Recall that you cannot place 


single quotes inside a single-quoted string. 


Floating-Point and Decimal Values 


You've used the f presentation type to format floating-point and Decimal values. For 





extremely large and small values of these types, Exponential (scientific) notation 





can be used to format the values more compactly. Let’s show the difference between £ 
and e for a large value, each with three digits of precision to the right of the decimal 


point: 


lick here to view code image 


In [5]: from decimal import Decimal 

In [6]: £'{Decimal ("10000000000000000000000000.0"):.3f}' 
Out [6]: '10000000000000000000000000.000' 

In [7]: £'{Decimal ("10000000000000000000000000.0"):.3e}' 
outi]: Is O0OEr25" 


For the e presentation type in snippet [5], the formatted value 1 .000e+25 is 


equivalent to 
1000 k oa 
If you prefer a capital E for the exponent, use the E presentation type rather than e. 


8.2.2 Field Widths and Alignment 


Previously you used field widths to format text in a specified number of character 
positions. By default, Python righto-aligns numbers and left-aligns other values such 
as strings—we enclose the results below in brackets ([ ] ) so you can see how the values 


align in the field: 


lick here to view code image 


amat S ara E e e Ile 


Gaek GL 2T 


Teal e aes a o E 
Out t2 Sh Vs 2500000)" 


to eoe Euel Tona rOn] 
Outs: Pnerkio I 


Snippet [2] shows that Python formats float values with six digits of precision to the 





right of the decimal point by default. For values that have fewer characters than the 
field width, the remaining character positions are filled with spaces. Values with more 


characters than the field width use as many character positions as they need. 


Explicitly Specifying Left and Right Alignment in a Field 


Recall that you can specify left and right alignment with < and >: 


lick here to view code image 
im Ae fa -<15ay 
ouea 127 a 


aoe Wee ae ACS Coe calomel 
OU LSI 325010000 IP 


erage [ROalins ets [ee neg es ake ay 
Oui (oles "if hello]' 


Centering a Value in a Field 


In addition, you can center values: 


lick here to view code image 
tm [Fs £%( {27s 7a) |e 
oaee *{ 27 1" 


TAEI mete ti lescy ors ay E 
Oneness Mil TSS A 


Ha [EO React ite) Fie dediey es 7 ts 
Cue (Ds “I hele: |" 





Centering attempts to spread the remaining unoccupied character positions equally to 
the left and right of the formatted value. Python places the extra space to the right if an 


odd number of character positions remain. 


8.2.3 Numeric Formatting 


There are a variety of numeric formatting capabilities. 


Formatting Positive Numbers with Signs 


Sometimes it’s desirable to force the sign on a positive number: 


rey lel! feC ou O.casiis! 
Out Li j ATU 


The + before the field width specifies that a positive number should be preceded by a +. 
A negative number always starts with a -. To fill the remaining characters of the field 
with Os rather than spaces, place a 0 before the field width (and after the + if there is 


one): 


Tene 2b a Teed se OutO dss] * 
Out lal ME OOOO0 0027 |" 


Using a Space Where a + Sign Would Appear in a Positive Value 


A space indicates that positive numbers should show a space character in the sign 


position. This is useful for aligning positive and negative values for display purposes: 


lick here to view code image 


ma prine E2 yee Nail Agee stole gr 27) 3 oly) 
ail 

Za 

= Dei 


Note that the two numbers with a space in their format specifiers align. If a field width 


is specified, the space should appear before the field width. 
Grouping Digits 
You can format numbers with thousands separators by using a comma (, ), as 


follows: 


Urabe Ali Ee TZAT anm, a 


out Alk Valk, S45 Sas 


TASIE EE TASG a Bnr e a eat 
Out [Slk TI237 456781 


8.2.4 String’s format Method 


Python’s f-strings were added to the language in version 3.6. Before that, formatting 


was performed with the string method format. In fact, f-string formatting is based on 





the format method’s capabilities. We show you the format method here because 








youll encounter it in code written prior to Python 3.6. You'll often see the format 
method in the Python documentation and in the many Python books and articles 
written before f-strings were introduced. However, we recommend using the newer f- 


string formatting that we’ve presented to this point. 


You call method format on a format string containing curly brace ({ }) placeholders, 





possibly with format specifiers. You pass to the method the values to be formatted. Let’s 


format the float value 17.489 rounded to the hundredths position: 





lick here to view code image 


ibigh (Lil ee Qa aes Eo Ema e A2) 
Outs MLI ae eae" 


In a placeholder, if there’s a format specifier, you precede it by a colon (:), as in f- 


strings. The result of the format call is a new string containing the formatted results. 





Multiple Placeholders 





A format string may contain multiple placeholders, in which case the format method’s 


arguments correspond to the placeholders from left to right: 


lick here to view code image 


rn [2c ts vi format (Amanda, Cyan) 
Out[2]: 'Amanda Cyan' 


Referencing Arguments By Position Number 





The format string can reference specific arguments by their position in the format 


method’s argument list, starting with position 0: 


lick here to view code image 


ne? qk Ee kOls KO aie! Stormac Happy. MBssriindeay:! |) 
Out[3]: 'Happy Happy Birthday' 





Note that we used the position number 0 (' Happy ') twice—you can reference each 


argument as often as you like and in any order. 


Referencing Keyword Arguments 


You can reference keyword arguments by their keys in the placeholders: 


lick here to view code image 


rn [Mii “rare asti -hormat (ha rst Amanda, last='Gray') 
Out[4]: 'Amanda Gray' 
ta [Sills (rast) (first) " -format (first: "Amanda! last='Gray') 
Out[5]: 'Gray Amanda' 


8.3 CONCATENATING AND REPEATING STRINGS 


In earlier chapters, we used the + operator to concatenate strings and the * operator to 
repeat strings. You also can perform these operations with augmented assignments. 


Strings are immutable, so each operation assigns a new string object to the variable: 


lick here to view code image 


En skh Minaippy* 
Ta Tale se = “ba renday? 
Ton Ses esa — SS 


Ta Ae sel 
Out[4]: ‘happy birthday" 


In [5]: symbol = '>! 


Im [6]: symbol *= 5 


in fits symbol 
outil iSi 





8.4 STRIPPING WHITESPACE FROM STRINGS 


There are several string methods for removing whitespace from the ends of a string. 
Each returns a new string leaving the original unmodified. Strings are immutable, so 


each method that appears to modify a string returns a new one. 


Removing Leading and Trailing Whitespace 


Let’s use string method strip to remove the leading and trailing whitespace from a 


string: 


lick here to view code image 


In [1]: sentence = '\t \n This is a test string, NENE An! 
In [2]: sentence.strip() 
Outs Als has isra EESE SERING.. 


Removing Leading Whitespace 


Method 1strip removes only leading whitespace: 


lick here to view code image 


In [3]: sentence.lstrip() 
Out (Si This is a test strings VENG \ne! 


Removing Trailing Whitespace 


Method rstrip removes only trailing whitespace: 


lick here to view code image 


In [4]: sentence.rstrip() 
Ou (Aa ee Neha ss s a CESSE sierangn: 


As the outputs demonstrate, these methods remove all kinds of whitespace, including 


spaces, newlines and tabs. 


8.5 CHANGING CHARACTER CASE 


In earlier chapters, you used string methods lower and upper to convert strings to all 


lowercase or all uppercase letters. You also can change a string’s capitalization with 


methods capitalize and title. 


Capitalizing Only a String’s First Character 


Method capitalize copies the original string and returns a new string with only the 


first letter capitalized (this is sometimes called sentence capitalization): 


lick here to view code image 


im [1]: “happy birthday: capitalize () 
Cull “Happy burthiday* 


Capitalizing the First Character of Every Word in a String 


Method title copies the original string and returns a new string with only the first 


character of each word capitalized (this is sometimes called book-title capitalization): 


lick here to view code image 


In [2]: 'strings: a deeper look'.title() 
Out[2]: 'Strings: A Deeper Look' 


8.6 COMPARISON OPERATORS FOR STRINGS 


Strings may be compared with the comparison operators. Recall that strings are 
compared based on their underlying integer numeric values. So uppercase letters 
compare as less than lowercase letters because uppercase letters have lower integer 
values. For example, 'A' is 65 and 'a' is 97. You’ve seen that you can check character 


codes with ord: 


lick here to view code image 


re [le Reo oea mien (ce VA ora (AN in} as Tord (Ta ils) 
Ae Gi ani 97 


Let’s compare the strings 'Orange' and 'orange' using the comparison operators: 


lick here to view code image 


In [2]: "Orange! == 'orange' 


Out[2]: False 


rn [si “Orange, I= “orange 
Out[3]: True 


in lAl: “Orange < “orange” 
Out[4]: True 


In [5]: "Orange" <= 'orange' 
Cues |= True 


In [6]: 'Orange' > 'orange' 
Out[6]: False 


In [7]: 'Orange' >= 'orange' 
Out[7]: False 





8.7 SEARCHING FOR SUBSTRINGS 


You can search in a string for one or more adjacent characters—known as a substring— 
to count the number of occurrences, determine whether a string contains a substring, 
or determine the index at which a substring resides in a string. Each method shown in 
this section compares characters lexicographically using their underlying numeric 


values. 


Counting Occurrences 


String method count returns the number of times its argument occurs in the string on 
which the method is called: 


lick here to view code image 


In [1]: sentence = 'to be or not to be that is the question' 
In [2]: sentence.count('to') 
Gut 2l: 2 


If you specify as the second argument a start_index, count searches only the slice 


string [start_index: ] —that is, from start_index through end of the string: 


lick here to view code image 


tTa [Sif sentence count "ro", 2 
Owe [Sale e ul 


If you specify as the second and third arguments the start_index and end_index, 
count searches only the slice string [start_index : end_index]—that is, from 


start_index up to, but not including, end_index: 


lick here to view code image 


In [4]: sentence.count('that', 127 25) 
Out aje 


Like count, each of the other string methods presented in this section has start_index 


and end_index arguments for searching only a slice of the original string. 


Locating a Substring in a String 


String method index searches for a substring within a string and returns the first 


index at which the substring is found; otherwise, a ValueError occurs: 


lick here to view code image 


In [5]: sentence.index('be') 
weits Iks 





String method rindex performs the same operation as index, but searches from the 
end of the string and returns the last index at which the substring is found; otherwise, a 


Value-Error occurs: 


lick here to view code image 


In [6]: sentence.rindex('be') 


Outsole 16 





String methods find and rfind perform the same tasks as index and rindex but, if 


the substring is not found, return -1 rather than causing a Value-Error. 


Determining Whether a String Contains a Substring 


If you need to know only whether a string contains a substring, use operator in or not 


in: 


lick here to view code image 


in Vii “that” “an sentence 
Out[7]: True 


In [8]: 'THAT' in sentence 


Out[8]: False 


In [9]: 'THAT' not in sentence 
Out[9]: True 





Locating a Substring at the Beginning or End of a String 


String methods startswith and endswith return True if the string starts with or 


ends with a specified substring: 


lick here to view code image 
In [10]: sentence.startswith('to') 
Ouwe PIOI: True 


In [11]: sentence.startswith('be') 
Out[11]: False 


In [12]: sentence.endswith('question') 
Ou [2s True 


In [13]: sentence.endswith('quest') 
Qut]: False 





8.8 REPLACING SUBSTRINGS 


A common text manipulation is to locate a substring and replace its value. Method 
replace takes two substrings. It searches a string for the substring in its first 
argument and replaces each occurrence with the substring in its second argument. The 
method returns a new string containing the results. Let’s replace tab characters with 


commas: 


lick here to view code image 


tna (is values = VIVE \VEsVtayts' 
im l2]: values- replace(!\t", %, 7") 
Omi EA eee yA ro! 


Method replace can receive an optional third argument specifying the maximum 


number of replacements to perform. 


8.9 SPLITTING AND JOINING STRINGS 


When you read a sentence, your brain breaks it into individual words, or tokens, each 
of which conveys meaning. Interpreters like [Python tokenize statements, breaking 
them into individual components such as keywords, identifiers, operators and other 
elements of a programming language. Tokens typically are separated by whitespace 
characters such as blank, tab and newline, though other characters may be used—the 


separators are known as delimiters. 


Splitting Strings 
We showed previously that string method sp1it with no arguments tokenizes a string 
by breaking it into substrings at each whitespace character, then returns a list of tokens. 


To tokenize a string at a custom delimiter (such as each comma-and-space pair), 


specify the delimiter string (such as,', ') that split uses to tokenize the string: 


lick here to view code image 


in ketters = = TA B, C, D: 
To (2 rertters spl iteip T) 


ouc oik TAN IBI ueu iDN] 


If you provide an integer as the second argument, it specifies the maximum number of 


splits. The last token is the remainder of the string after the maximum number of splits: 


lick here to view code image 


Ta ele letters Speta ty 2) 
out si A BITNE Du 


There is also an rsplit method that performs the same task as split but processes 


the maximum number of splits from the end of the string toward the beginning. 


Joining Strings 
String method join concatenates the strings in its argument, which must be an 


iterable containing only string values; otherwise, a TypeError occurs. The separator 


between the concatenated items is the string on which you call join. The following 


code creates strings containing comma-separated lists of values: 


lick here to view code image 


Tam letters listam MAAR eRe SSPE 
toke Saletan letters List) 
Out to DA BE DI 


The next snippet joins the results of a list comprehension that creates a list of strings: 


lick here to view code image 


Takele T7 ON SCEN Eor i aay range royn 
out kolk: YO 25.35: Aro o r Onon 


In the “Files and Exceptions” chapter, you'll see how to work with files that contain 
comma-separated values. These are known as CSV files and are a common format for 
storing data that can be loaded by spreadsheet applications like Microsoft Excel or 
Google Sheets. In the data science case study chapters, you'll see that many key 
libraries, such as NumPy, Pandas and Seaborn, provide built-in capabilities for working 
with CSV data. 


String Methods partitionand rpartition 


String method partition splits a string into a tuple of three strings based on the 


method’s separator argument. The three strings are 


e the part of the original string before the separator, 
e the separator itself, and 


e the part of the string after the separator. 


This might be useful for splitting more complex strings. Consider a string representing 


a student’s name and grades: 


"Amanda: 89, 97, 92! 


Let’s split the original string into the student’s name, the separator ': ' anda string 


representing the list of grades: 


lick here to view code image 


new Amandas 897 97r SO r pareren (aa t) 
Ouri: (Amandan ME ei ON T OT out) 


To search for the separator from the end of the string instead, use method 


rpartition to split. For example, consider the following URL string: 


' ttp://www.deitel.com/books/PyCDS/table of contents.html' 





Let’s use rpartition split 'table of contents .html'from the rest of the URL: 





lick here to view code image 


In [8]: url = 'http://www.deitel.com/books/PyCDS/table of contents.htmL' 
Moet: rest of url, Separator, document = url.rpartition(T] 

In [10]: document 

out Hol: cable vor contents. htm! 


iigies (Palulyl| Bs esishe oa abla: 
Out[11]: 'http://www.deitel.com/books/PyCDS' 








String Method splitlines 


In the “Files and Exceptions” chapter, you'll read text from a file. If you read large 
amounts of text into a string, you might want to split the string into a list of lines based 
on newline characters. Method spl1itlines returns a list of new strings representing 
the lines of text split at each newline character in the original string. Recall that Python 
stores multiline strings with embedded \n characters to represent the line breaks, as 


shown in snippet [13]: 


lick here to view code image 


tane aines = aun aS is line 1 
This is lane? 


Thio mie lineg™un 


ta bie Anes 
Ones ee "mhrs rs line thas) rs Iine2\nrhis as lanes! 





TAEAE 
ouka]: 


lines.splitlines () 
This is ane mM This is line2", "This as line3'] 


Passing True to splitlines keeps the newlines at the end of each string: 


lick here to view code image 


ibe ESSE 
Owe koe: 


lines.splitlines (True) 
ehis as  dhiner tn ,) This is lime2\n!; This 1S line3'] 


8.10 CHARACTERS AND CHARACTER-TESTING 
METHODS 


Many programming languages have separate string and character types. In Python, a 


character is simply a one-character string. 


Python provides string methods for testing whether a string matches certain 
characteristics. For example, string method isdigit returns True if the string on 


which you call the method contains only the digit characters (0—9). You might use this 


when validating user input that must contain only digits: 


lick here to view code image 


Tanie 
Ouel: 


ta [2]: 
cut [2A 


a .isdigit() 


False 


AA Tis e O) 


True 


and the string method isalnum returns True if the string on which you call the 


method is alphanumeric—that is, it contains only digits and letters: 


lick here to view code image 


TANSI 
Guelei: 


Ta A] 
Oui [4 


YASS 76" i salnum) 


True 


"123 Main Street'.isalnum() 


False 


The table below shows many of the character-testing methods. Each method returns 


alse if the condition described is not satisfied: 


String Method 


Description 





isalnum() 


isalpha() 


isdecimal () 


isdigit () 


isidentifier () 


islower () 


isnumeric() 


isspace() 


Returns True if the string contains only alphanumeric 


characters (i.e., digits and letters). 


Returns True if the string contains only alphabetic 


characters (i.e., letters). 


Returns True if the string contains only decimal integer 
characters (that is, base 10 integers) and does not 


contain a + or - sign. 


Returns True if the string contains only digits (e.g., '0', 
040 $ A u) 


Returns True if the string represents a valid identifier. 


Returns True if all alphabetic characters in the string are 


lowercase characters (e.g., 'a', 'b', 'c"). 


Returns True if the characters in the string represent a 
numeric value without a + or - sign and without a 


decimal point. 


Returns True if the string contains only whitespace 


characters. 


Returns True if the first character of each word in the 


istitle() string is the only uppercase character in the word. 


Returns True if all alphabetic characters in the string are 
isupper () 
uppercase characters (e.g., 'A', 'B', 'C"). 


8.11 RAW STRINGS 


Recall that backslash characters in strings introduce escape sequences—like \n for 
newline and \t for tab. So, if you wish to include a backslash in a string, you must use 
two back-slash characters \\. This makes some strings difficult to read. For example, 
Microsoft Windows uses backslashes to separate folder names when specifying a file’s 


location. To represent a file’s location on Windows, you might write: 


lick here to view code image 


In [1]: file path = 'C:\\MyFolder\\MySubFolder\\MyFile.txt' 


tae bakes pach 
Ouüt[2]: 'C:\\MyFolder\\MySubFolder\\MyFile.txt' 


For such cases, raw strings—preceded by the character r—are more convenient. They 
treat each backslash as a regular character, rather than the beginning of an escape 


sequence: 


lick here to view code image 


In [3]: file path = r'C:\MyFolder\MySubFolder\MyFile.txt' 


Poe ee ea kes path 
Out [4]: 'C:\\MyFolder\\MySubFolder\\MyFile.txt' 


Python converts the raw string to a regular string that still uses the two backslash 
characters in its internal representation, as shown in the last snippet. Raw strings can 
make your code more readable, particularly when using the regular expressions that we 
discuss in the next section. Regular expressions often contain many backslash 


characters. 


.12 INTRODUCTION TO REGULAR EXPRESSIONS 


Sometimes you'll need to recognize patterns in text, like phone numbers, e-mail 
addresses, ZIP Codes, web page addresses, Social Security numbers and more. A 
regular expression string describes a search pattern for matching characters in 


other strings. 


Regular expressions can help you extract data from unstructured text, such as social 
media posts. They’re also important for ensuring that data is in the correct format 


before you attempt to process it. 3 


3 The topic of regular expressions might feel more challenging than most other Python 
features youve used. After mastering this subject, youll often write more concise code 
than with conventional string-processing techniques, speeding the code-development 
process. Youll also deal with fringe cases you might not ordinarily think about, possibly 


avoiding subtle bugs. 


Validating Data 


Before working with text data, you'll often use regular expressions to validate the data. 


For example, you can check that: 


e AU.S. ZIP Code consists of five digits (such as 02215) or five digits followed by a 
hyphen and four more digits (such as 02215-4775). 


e Astring last name contains only letters, spaces, apostrophes and hyphens. 
e An e-mail address contains only the allowed characters in the allowed order. 


e A U.S. Social Security number contains three digits, a hyphen, two digits, a hyphen 
and four digits, and adheres to other rules about the specific numbers that can be 


used in each group of digits. 


You'll rarely need to create your own regular expressions for common items like these. 
Websites like 


e ttps://regex101.com 


e ttp://www.regexlib.com 








e ttps://www.regular-expressions.info 


and others offer repositories of existing regular expressions that you can copy and use. 
Many sites like these also provide interfaces in which you can test regular expressions 
to determine whether theyll meet your needs. 


Other Uses of Regular Expressions 


In addition to validating data, regular expressions often are used to: 


e Extract data from text (sometimes known as scraping)—For example, locating all 


URLs in a web page. [You might prefer tools like BeautifulSoup, XPath and lxml.] 


e Clean data—For example, removing data that’s not required, removing duplicate 
data, handling incomplete data, fixing typos, ensuring consistent data formats, 


dealing with outliers and more. 


e Transform data into other formats—For example, reformatting data that was 
collected as tab-separated or space-separated values into comma-separated values 


(CSV) for an application that requires data to be in CSV format. 


8.12.1 re Module and Function ful lmatch 


To use regular expressions, import the Python Standard Library’s re module: 


To el i amp oriEeace 


One of the simplest regular expression functions is fullmatch, which checks whether 


the entire string in its second argument matches the pattern in its first argument. 


Matching Literal Characters 


Let’s begin by matching literal characters—that is, characters that match themselves: 


lick here to view code image 





in (Zits pattern = 027M 5 

io Wisi Mateni af re.:rullmateh(pactern, 1022154) else 'No match' 
outl Matens 

in [4]: Maten! If re. tfullimatch (pattern, 512201) else 'No match' 
Out [4 "Ne mateh" 








The function’s first argument is the regular expression pattern to match. Any string can 
be a regular expression. The variable pattern’s value, '02215', contains only literal 
digits that match themselves in the specified order. The second argument is the string 


that should entirely match the pattern. 


If the second argument matches the pattern in the first argument, ful 1match returns 





an object containing the matching text, which evaluates to True. We'll say more about 
this object later. In snippet [4], even though the second argument contains the same 


digits as the regular expression, they’re in a different order. So there’s no match, and 





fullmatch returns None, which evaluates to False. 


Metacharacters, Character Classes and Quantifiers 


Regular expressions typically contain various special symbols called metacharacters, 


which are shown in the table below: 


Regular expression metacharacters 





Ii ONS Se 


The \ metacharacter begins each of the predefined character classes, each 


matching a specific set of characters. Let’s validate a five-digit ZIP Code: 


lick here to view code image 


in Ses "Validi ir re. rulllimatehy(as Vat ott VO eds) else “ima lac! 
Out LSI varia! 

in (ed: Valad! ack celtuliimatch(te Vat att. "9876" else "Invalid! 
Owe kel kneading! 


In the regular expression \d{5}, \dis a character class representing a digit (0-9). A 
character class is a regular expression escape sequence that matches one character. To 
match more than one, follow the character class with a quantifier. The quantifier {5} 


repeats \d five times, as if we had written \d\d\d\d\d, to match five consecutive 





digits. In snippet [6], fullmatch returns None because '9876' contains only four 


consecutive digit characters. 


Other Predefined Character Classes 


The table below shows some common predefined character classes and the groups of 
characters they match. To match any metacharacter as its literal value, precede it by a 


backslash (\). For example, \\ matches a backslash (\) and \$ matches a dollar sign 
($). 


Character 


Matches 
class 





Va Any digit (0-9). 

\D Any character that is not a digit. 

\s Any whitespace character (such as spaces, tabs and newlines). 
\S Any character that is not a whitespace character. 


Any word character (also called an alphanumeric 
\w character)—that is, any uppercase or lowercase letter, any digit 


or an underscore 


\W Any character that is not a word character. 


Custom Character Classes 


Square brackets, [] , define a custom character class that matches a single 
character. For example, [aeiou] matches a lowercase vowel, [A-z] matches an 
uppercase letter, [a-z] matches a lowercase letter and [a-zA-Z] matches any 


lowercase or uppercase letter. 


Let’s validate a simple first name with no spaces or punctuation. We'll ensure that it 


begins with an uppercase letter (A-z) followed by any number of lowercase letters 


(a-z): 


lick here to view code image 


ta ie Valid! If re. fullmactccht' A-2] faz * %, "Watiy') else 'ITnyalid' 
Güte “Varid! 
inele valid irr re. u llmateh C ASA) aa n 'eva') else 'Invalid' 
Cutis: Vilma” 

4 > 








A first name might contain many letters. The * quantifier matches zero or more 
occurrences of the subexpression to its left (in this case, [a-z]). So [A-Z] [a-z] * 
matches an uppercase letter followed by zero or more lowercase letters, such as 


'Amanda', 'Bo' oreven 'E'. 


When a custom character class starts with a caret (^), the class matches any character 


that’s not specified. So [^a-z] matches any character that’s not a lowercase letter: 


lick here to view code image 


in Ole Match! Lf re. Cullmacceh( ika- T AN else "Ne match” 
Ome tol: Maten"! 

ina [dolls “Mateh” af re.-tCullmatch (| a72], Ta else "Ne match” 
Out (WO SNo maten? 


Metacharacters in a custom character class are treated as literal characters—that is, the 


characters themselves. So [*+$] matches a single *, + or $ character: 


lick here to view code image 


to [ey T Mateh" ab re: rfullmaceh [+45], T> else "Ne match! 
Out [AAs Mate: 

in fie)  Mateh! if re.-fullmatenh krol]; Te else "New matelh 
Our [124 ss No macchi! 


* vs. + Quantifier 


If you want to require at least one lowercase letter in a first name, you can replace the * 


quantifier in snippet [7] with +, which matches at least one occurrence of a 


subexpression: 


lick here to view code image 


tn Ss "valid sit re. hulelimateh(* A=) la- zI" "Wally') else 'Invalid' 
Ome lls Vwalkara" 
ny lae valid. at re -fullmatecei OIA=Al lazza] 7 TE) else trnivalid! 
out PEA e Minya rdi 
4 > 











Both * and + are greedy—they match as many characters as possible. So the regular 
expression [A-Z] [a-z]+ matches 'Al', 'Eva', 'Samantha', 'Benjamin' and any 


other words that begin with a capital letter followed at least one lowercase letter. 


Other Quantifiers 


The ? quantifier matches zero or one occurrences of a subexpression: 


lick here to view code image 




















In [15]: Match" af ce.fullmateh(*labelil?ed", "labelleg") ales} "No match' 
Out PES Maten: 
in lkol: "Maten af se. fullmatch (label lved", "labeled') else 'No match' 
Ome(he|-e Mateni 
in (iy: *Mateh’ ir centulimateh("abell?ed", "'labellled') else 'No matc 
ae ais No maken! 
q d 





The regular expression labe11?ed matches labelled (the U.K. English spelling) and 
labeled (the U.S. English spelling), but not the misspelled word Label11lled. In each 
snippet above, the first five literal characters in the regular expression (label) match 
the first five characters of the second arguments. Then 1? indicates that there can be 


zero or one more 1 characters before the remaining literal ed characters. 


You can match at least n occurrences of a subexpression with the {n, } quantifier. 


The following regular expression matches strings containing at least three digits: 


lick here to view code image 





in Mel: "Match if sre. ul imavehi(e Ndi sy}, Wigs) else iNO macehi 

Ouwe Egi Matens 

to Llo]: "Matek aif re-rullimaten tr Ndi, ty "1234567890') else 'No match 
ate [LOU Maten 

ta eol: Macchi “uke re ce ulelimaeehv(ae Nds hi lO else Ne march! 

Out (20) “No: match! 














A 86) 





You can match between n and m (inclusive) occurrences of a subexpression with the 
{n,m} quantifier. The following regular expression matches strings containing 3 to 
6 digits: 


lick here to view code image 











Ta [2l]: Matek! if re.rullmatceh tr Vadis, Ort, M123") elkse "No maten! 
owt eR Mate hy 

in i22]: 'Matecehi if re: rullmaceceh (r ndio; G)! (T2345 6 ")) else No match" 
Ouwe l22 | Match! 

im 2l: 'Mateh' if re.-fuúullmaccenh lr Nds, 0t TI234560/") else “No match! 
Out Molk ANo matchi 

in [24]: Match! If “recttul iimawechitae Vadis, 63", 121) else 'Ne mateh' 
Outl24l: “No matecehi 








4 > 


8.12.2 Replacing Substrings and Splitting Strings 


The re module provides function sub for replacing patterns in a string, and function 


split for breaking a string into pieces, based on patterns. 


Function sub—Replacing Patterns 


By default, the re module’s sub function replaces all occurrences of a pattern with 
the replacement text you specify. Let’s convert a tab-delimited string to comma- 
delimited: 


lick here to view code image 


Ta (jks importe re 


wa PAN ear aoaaa e Ue Ue N EANN EN EL) 


(Oea e A A S A 


The sub function receives three required arguments: 


e the pattern to match (the tab character '\t') 
e the replacement text (', ')and 


e the string to be searched ('1\t2\t3\t4') 


and returns a new string. The keyword argument count can be used to specify the 


maximum number of replacements: 


lick here to view code image 


Tee ren SUPLINI a TINEA NESNA, count=2) 
Out (Sle 2ee SEA 


Function split 


The split function tokenizes a string, using a regular expression to specify the 
delimiter, and returns a list of strings. Let’s tokenize a string by splitting it at any 
comma that’s followed by o or more whitespace characters—\s is the whitespace 
character class and * indicates zero or more occurrences of the preceding 


subexpression: 


lick here to view code image 


TA Ae re splitti; Vs tie ay, 2; 3,4, Ea A E On) 
Oue FAJE: [ete v2, Sey An usur Gt, Vay Cua 


Use the keyword argument maxsp1it to specify the maximum number of splits: 
lick here to view code image 


re) Sal ees p INET see Veta 2r EEN 5r Op 7r 9"; maxsplit=3) 
OUEST a T A Sr Or Trew 


In this case, after the 3 splits, the fourth string contains the rest of the original string. 


8.12.3 Other Search Functions; Accessing Matches 


Earlier we used the ful 1match function to determine whether an entire string 


matched a regular expression. There are several other searching functions. Here, we 








discuss the search, match, findall and finditer functions, and show how to 


access the matching substrings. 


Function search—Finding the First Match Anywhere in a String 


Function search looks in a string for the first occurrence of a substring that matches a 
regular expression and returns a match object (of type SRE_Match) that contains the 


matching substring. The match object’s group method returns that substring: 


lick here to view code image 








im [js import re 

In [2]: result = re.search('Python', ‘Python aus FUNN) 
in Sit result-group() ai result else 'not found' 
Guelo: Vby.chon:! 


Function search returns None if the string does not contain the pattern: 


lick here to view code image 


In [4]: result2 = re.search('fun!', TEyCRONn aS tata.) 
Ta ols result group iE result2 else 'not found' 
CUE [SJE nots round! 


You can search for a match only at the beginning of a string with function match. 


Ignoring Case with the Optional £1 ags Keyword Argument 





Many re module functions receive an optional f1ags keyword argument that changes 
how regular expressions are matched. For example, matches are case sensitive by 
default, but by using the re module’s IGNORECASE constant, you can perform a case- 


insensitive search: 


lick here to view code image 





In [6]: result3 = re.search('Sam', "SAM WHITE', flags=re.IGNORECASE 





s 


in vir results .greup()) 28 results else Inot found' 
Oute “SAM? 


Here, 'SAM' matches the pattern 'Sam' because both have the same letters, even 


though 'SAM' contains only uppercase letters. 


Metacharacters That Restrict Matches to the Beginning or End of a String 


The * metacharacter at the beginning of a regular expression (and not inside square 
brackets) is an anchor indicating that the expression matches only the beginning of a 


string: 


lick here to view code image 








in Tel: result = re.search('*“Python', revenon ess EUNT) 
Ta [9] result Groupi) LE result else 'not found! 

OUE LI “Python” 

In [10]: result = re.search('*fun', VPyiehionm LS Fan) 
To ae result group N rE result else 'not found' 
Ouelli |: mot found! 


Similarly, the $ metacharacter at the end of a regular expression is an anchor 


indicating that the expression matches only the end of a string: 


lick here to view code image 











In [12]: result = re search ("Pythons', Mythen Ss sft) 
Ta Vises result group) ak result else 'not found' 
Ours. "noe found" 

In [14]: result = ré.search("fun$', ‘Python is fun") 

TA S]e result group) LE result else 'not found' 

One Sie? fun! 





Function findall and finditer—Finding All Matches in a String 


Function findal1 finds every matching substring in a string and returns a list of the 
matching substrings. Let’s extract all the U.S. phone numbers from a string. For 


simplicity we'll assume that U.S. phone numbers have the form ###-###-####: 


lick here to view code image 





In [16]: contact = 'Wally White, Homes S5Si=-55o—i2 34, Work: S55—555—45 21" 
im A: ceskamndadlil(as  Nd4 3) —\etst=\di ayer, contact) 
cutke eS S55 551) SAa ee 5 55 515 SAS eal | 

4 > 








Function finditer works like finda11, but returns a lazy iterable of match objects. 








For large numbers of matches, using finditer can save memory because it returns 





one match at a time, whereas finda11 returns all the matches at once: 


lick here to view code image 


ta iiel: tor phone in rer rrindiiter (is Nad} — Veit Sip Aaa contact): 
print (phone.group() ) 


555-956-1224 


See aaa 


Capturing Substrings in a Match 


You can use parentheses metacharacters— ( and ) —to capture substrings in a 
match. For example, let’s capture as separate substrings the name and e-mail address 


in the string text: 


lick here to view code image 








En) (oils text = Charlie Cyan, e-mail: demol@deitel.com' 
TA [20]: pattern =e" (ASA) la z2] [A-Z] [a-z]+), e-mail: (\wt@\wt\.\w{3}) 
n [21]: result = re.search(pattern, text) 

4 > 








The regular expression specifies two substrings to capture, each denoted by the 
metacharacters ( and ). These metacharacters do not affect whether the pattern is 
found in the string text—the match function returns a match object only if the entire 


pattern is found in the string text. 


Let’s consider the regular expression: 


e '({A-Z][a-z]+ [A-Z] [a-z]+) ' matches two words separated by a space. Each 


word must have an initial capital letter. 
e ', e-mail: ' contains literal characters that match themselves. 


e (\w+@\wt\.\w{3}) matches a simple e-mail address consisting of one or more 
alphanumeric characters (\w+), the @ character, one or more alphanumeric 
characters (\w+), a dot (\ . ) and three alphanumeric characters (\w{3}). We 
preceded the dot with \ because a dot (.) is a regular expression metacharacter that 


matches one character. 


The match object’s groups method returns a tuple of the captured substrings: 


lick here to view code image 


Ta P22) result: groups) 
OuGl 22): (“Charlie Cyan", Vdemol@dertel com” ) 


The match object’s group method returns the entire match as a single string: 


lick here to view code image 


in VAS |e cesmikt rg roue t) 
Out [23]: 'Charlie Cyan, e-mail: demol@deitel.com' 


You can access each captured substring by passing an integer to the group method. 


The captured substrings are numbered from 1 (unlike list indices, which start at 0): 


lick here to view code image 
In [24]: result group (1) 
Out[24]: 'Charlie Cyan? 


in [25 result group(2) 
Out [25]: 'demol@deitel.com' 


8.13 INTRO TO DATA SCIENCE: PANDAS, REGULAR 
EXPRESSIONS AND DATA MUNGING 


Data does not always come in forms ready for analysis. It could, for example, be in the 


wrong format, incorrect or even missing. Industry experience has shown that data 


cientists can spend as much as 75% of their time preparing data before they begin their 
studies. Preparing data for analysis is called data munging or data wrangling. 


These are synonyms—from this point forward, we'll say data munging. 


Two of the most important steps in data munging are data cleaning and transforming 
data into the optimal formats for your database systems and analytics software. Some 


common data cleaning examples are: 


e deleting observations with missing values, 

e substituting reasonable values for missing values, 

e deleting observations with bad values, 

e substituting reasonable values for bad values, 

e tossing outliers (although sometimes you'll want to keep them), 
e duplicate elimination (although sometimes duplicates are valid), 
e dealing with inconsistent data, 


e and more. 


You're probably already thinking that data cleaning is a difficult and messy process 
where you could easily make bad decisions that would negatively impact your results. 
This is correct. When you get to the data science case studies in the later chapters, you'll 
see that data science is more of an empirical science, like medicine, and less of a 
theoretical science, like theoretical physics. Empirical sciences base their conclusions 
on observations and experience. For example, many medicines that effectively solve 
medical problems today were developed by observing the effects that early versions of 
these medicines had on lab animals and eventually humans, and gradually refining 
ingredients and dosages. The actions data scientists take can vary per project, be based 
on the quality and nature of the data and be affected by evolving organization and 


professional standards. 


Some common data transformations include: 


e removing unnecessary data and features (we'll say more about features in the data 


science case studies), 


e combining related features, 


e sampling data to obtain a representative subset (we'll see in the data science case 


studies that random sampling is particularly effective for this and we'll say why), 
e standardizing data formats, 
e grouping data, 


e and more. 


It’s always wise to hold onto your original data. We'll show simple examples of cleaning 


and transforming data in the context of Pandas Series and DataFrames. 


Cleaning Your Data 


Bad data values and missing values can significantly impact data analysis. Some data 
scientists advise against any attempts to insert “reasonable values.” Instead, they 
advocate clearly marking missing data and leaving it up to the data analytics package to 
handle the issue. Others offer strong cautions. * 

4 This footnote was abstracted from a comment sent to us July 20, 2018 by one of the 
books reviewers, Dr. Alison Sanchez of the University of San Diego School of Business. 
She commented: Be cautious when mentioning ‘substituting reasonable values' for 
missing or bad values. A stern warning: 'Substituting' values that increase statistical 
significance or give more 'reasonable' or 'better' results is not permitted. 'Substituting' 
data should not turn into 'fudging' data. The first rule readers should learn is not to 
eliminate or change values that contradict their hypotheses. 'Substituting reasonable 
values' does not mean readers should feel free to change values to get the results they 


want. 


Let’s consider a hospital that records patients’ temperatures (and probably other vital 





signs) four times per day. Assume that the data consists of a name and four float 


values, such as 


[*Brown, Sue"; 98.6, 95.47 96.7, 0/0] 





The preceding patient’s first three recorded temperatures are 99.7, 98.4 and 98.7. The 
last temperature was missing and recorded as 0.0, perhaps because the sensor 


malfunctioned. The average of the first three values is 98.57, which is close to normal. 


However, if you calculate the average temperature including the missing value for 
which 0.0 was substituted, the average is only 73.93, clearly a questionable result. 
Certainly, doctors would not want to take drastic remedial action on this patient—it’s 


crucial to “get the data right.” 


One common way to clean the data is to substitute a reasonable value for the missing 
temperature, such as the average of the patient’s other readings. Had we done that 
above, then the patient’s average temperature would remain 98.57—a much more likely 


average temperature, based on the other readings. 


Data Validation 


Let’s begin by creating a Series of five-digit ZIP Codes from a dictionary of city- 
name/five-digit-ZIP-Code key—value pairs. We intentionally entered an invalid ZIP 


Code for Miami: 


lick here to view code image 


in (ij ampere pandas as pd 





In [2]: zips = pd.Series({'Boston': NO 2 Ato Miter e O 


Ta T3]: Zaps 





Qut [3a 
Boston 02215 
Miami SSO) 





dtype: object 


Though zips looks like a two-dimensional array, it’s actually one-dimensional. The 
“second column” represents the Series’ ZIP Code values (from the dictionary’s 


values), and the “first column” represents their indices (from the dictionary’s keys). 


We can use regular expressions with Pandas to validate data. The str attribute of a 
Series provides string-processing and various regular expression methods. Let’s use 


the str attribute’s match method to check whether each ZIP Code is valid: 


lick here to view code image 


in 4: Zaps estrematehi(c Ndio) 


out [4]: 
Boston True 
Miami False 


dtype: bool 


Method match applies the regular expression \d{5} to each Series element, 
attempting to ensure that the element is comprised of exactly five digits. You do not 
need to loop explicitly through all the ZIP Codes—match does this for you. This is 
another example of functional-style programming with internal rather than external 
iteration. The method returns a new Series containing True for each valid element. 


In this case, the ZIP Code for Miami did not match, so its element is False. 


There are several ways to deal with invalid data. One is to catch it at its source and 
interact with the source to correct the value. That’s not always possible. For example, 
the data could be coming from high-speed sensors in the Internet of Things. In that 
case, we would not be able to correct it at the source, so we could apply data cleaning 
techniques. In the case of the bad Miami ZIP Code of 3310, we might look for Miami 
ZIP Codes beginning with 3310. There are two—33101 and 33109—and we could pick 


one of those. 


Sometimes, rather than matching an entire value to a pattern, you'll want to know 
whether a value contains a substring that matches the pattern. In this case, use method 
contains- instead of match. Let’s create a Series of strings, each containing a U.S. 
city, state and ZIP Code, then determine whether each string contains a substring 
matching the pattern ' [A-Z]{2} ' (a space, followed by two uppercase letters, 
followed by a space): 


lick here to view code image 











In [5]: cities = pd.Series(['Boston, MA 02215", “Miami; EFL 33101) 
In [6]: cities 

utle: 

0 Boston, MA 02215 

il Miami, FL 33101 


dtype: object 


m (PIS ereres Str contains (ac! [A-2] {2} ') 
Owe Dale 

0 ree 

J True 


dtype: bool 


i eS]: ecierest st romatehuae | AS zi 424 D 


Out [31/2 
0 False 
al False 


dtype: bool 


We did not specify the index values, so the Series uses zero-based indexes by default 
(snippet [6]). Snippet [7] uses contains to show that both Series elements 
contain substrings that match ' [A-Z]{2} '.Snippet [8] uses match to show that 
neither element’s value matches that pattern in its entirety, because each has other 


characters in its complete value. 


Reformatting Your Data 


We've discussed data cleaning. Now let’s consider munging data into a different format. 
As a simple example, assume that an application requires U.S. phone numbers in the 
format ###-###-###+#, with hyphens separating each group of digits. The phone 
numbers have been provided to us as 10-digit strings without hyphens. Let’s create the 


DataFrame: 


lick here to view code image 


In [9]: contacts = [['Mike Green', ‘demol@dertel. coms, "35555555554 ji, 
['Sue Brown', VaemozG@dertelvcom", "55555512341 





In [10]: contactsdf = pd.DataFrame (contacts, 





columns=['Name', 'Email', 'Phone']) 


ta [eis comteacGs dik 
Otea AE: 





Name Email Phone 
O Mike Green demol@deitel.com 5555555555 
Al Sue Brown demo2@deitel.com 5555551234 





In this DataFrame, we specified column indices via the columns keyword argument 
but did not specify row indices, so the rows are indexed from 0. Also, the output shows 
the column values right aligned by default. This differs from Python formatting in 
which numbers in a field are right aligned by default but non-numeric values are left 


aligned by default. 


Now, let’s munge the data with a little more functional-style programming. We can 
map the phone numbers to the proper format by calling the Series method map on 


the DataFrame’s 'Phone' column. Method map’s argument is a function that receives 





a value and returns the mapped value. The function get_formatted_ phone maps 10 
consecutive digits into the format ###-###-####: 


lick here to view code image 


nm. 2c imp osie rE 


mAN E Ter get formatted phone (value): 
; resule = re.rfullmacchr (Vets) (Vet ss) Aaa, value) 


recurn Y=" JOIN (result. groups) ack result else valu 





The regular expression in the block’s first statement matches only 10 consecutive digits. 
It captures substrings containing the first three digits, the next three digits and the last 


four digits. The return statement operates as follows: 


e If result is None, we simply return value unmodified. 


e Otherwise, we call result. groups () to get a tuple containing the captured 
substrings and pass that tuple to string method join to concatenate the elements, 


separating each from the next with '-' to form the mapped phone number. 


Series method map returns a new Series containing the results of calling its 
function argument for each value in the column. Snippet [15] displays the result, 


including the column’s name and type: 


lick here to view code image 


in [4]: formatted phone = conmtactsdr|"Phone*|.map(get formatted phone) 


in [kos formatted phone 

0 55555553555 

i 355-55571234 

Name: Phone, dtype: object 


Once yov’ve confirmed that the data is in the correct format, you can update it in the 


original DataFrame by assigning the new Series tothe 'Phone' column: 


lick here to view code image 


renee Comvact sce) Winome. i= formatted phone 


in, [Mee contactsd£ 
Out i: 

Name Email Phone 
O Mike Green demol@deitel.com 555-555-5555 
il Sue Brown demo2@deitel.com 555-555-1234 











e'll continue our pandas discussion in the next chapter’s Intro to Data Science 


section, and we'll use pandas in several later chapters. 


8.14 WRAP-UP 


In this chapter, we presented various string formatting and processing capabilities. You 





formatted data in f-strings and with the string method format. We showed the 
augmented assignments for concatenating and repeating strings. You used string 
methods to remove whitespace from the beginning and end of strings and to change 
their case. We discussed additional methods for splitting strings and for joining 


iterables of strings. We introduced various character-testing methods. 


We showed raw strings that treat backslashes (\) as literal characters rather than the 
beginning of escape sequences. These were particularly useful for defining regular 


expressions, which often contain many backslashes. 


Next, we introduced the powerful pattern-matching capabilities of regular expressions 





with functions from the re module. We used the ful 1match function to ensure that an 
entire string matched a pattern, which is useful for validating data. We showed how to 
use the replace function to search for and replace substrings. We used the split 
function to tokenize strings based on delimiters that match a regular expression 
pattern. Then we showed various ways to search for patterns in strings and to access 


the resulting matches. 


In the Intro to Data Science section, we introduced the synonyms data munging and 
data wrangling and showed q sample data munging operation, namely transforming 
data. We continued our discussion of Panda’s Series and DataFrames by using 


regular expressions to validate and munge data. 


In the next chapter, we'll continue using various string-processing capabilities as we 
introduce reading text from files and writing text to files. We’ll use the csv module for 
manipulating comma-separated value (CSV) files. We'll also introduce exception 
handling so we can process exceptions as they occur, rather than displaying a 


traceback. 


. Files and Exceptions 


Objectives 

In this chapter, you'll: 

m Understand the notions of files and persistent data. 

m Read, write and update files. 

mw Read and write CSV files, a common format for machine-learning datasets. 


mw Serialize objects into the JSON data-interchange format—commonly used to transmit 


over the Internet—and deserialize JSON into objects. 


m Use the with statement to ensure that resources are properly released, avoiding 


“resource leaks”. 


m Use the try statement to delimit code in which exceptions may occur and handle 


those exceptions with associated except clauses. 


m Use the try statement’s else clause to execute code when no exceptions occur in the 


try suite. 





mw Use the try statement’s finally clause to execute code regardless of whether an 


exception occurs in the try. 

m raise exceptions to indicate runtime problems. 

m Understand the traceback of functions and methods that led to an exception. 

mw Use pandas to load into a DataFrame and process the Titanic Disaster CSV dataset. 
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9.1 INTRODUCTION 


Variables, lists, tuples, dictionaries, sets, arrays, pandas Series and pandas 
DataFrames offer only temporary data storage. The data is lost when a local variable 
“goes out of scope” or when the program terminates. Files provide long-term retention 
of typically large amounts of data, even after the program that created the data 
terminates, so data maintained in files is persistent. Computers store files on secondary 
storage devices, including solid-state drives, hard disks and more. In this chapter, we 


explain how Python programs create, update and process data files. 


We consider text files in several popular formats—plain text, JSON (JavaScript Object 
Notation) and CSV (comma-separated values). We’ll use JSON to serialize and 
deserialize objects to facilitate saving those objects to secondary storage and 
transmitting them over the Internet. Be sure to read this chapter’s Intro to Data Science 
section in which we'll use both the Python Standard Library’s csv module and pandas 
to load and manipulate CSV data. In particular, we'll look at the CSV version of the 
Titanic disaster dataset. We’ll use many popular datasets in upcoming data-science 
case-study chapters on natural language processing, data mining Twitter, IBM Watson, 


machine learning, deep learning and big data. 


As part of our continuing emphasis on Python security, we'll discuss the security 
vulnerabilities of serializing and deserializing data with the Python Standard Library’s 


pickle module. We recommend JSON serialization in preference to pickle. 


We also introduce exception handling. An exception indicates an execution-time 





problem. You’ve seen exceptions of types ZeroDivisionError, NameError, 


ValueError, StatisticsError, TypeError, IndexError, KeyError and 





RuntimeError. We'll show how to deal with exceptions as they occur by using try 
statements and associated except clauses to handle exceptions. We'll also discuss the 


try statement’s else and finally clauses. The features presented here help you 





write robust, fault-tolerant programs that can deal with problems and continue 


executing or terminate gracefully. 


rograms typically request and release resources (such as files) during program 
execution. Often, these are in limited supply or can be used only by one program at a 
time. We show how to guarantee that after a program uses a resource, it’s released for 
use by other programs, even if an exception has occurred. You'll use the with 


statement for this purpose. 


9.2 FILES 


Python views a text file as a sequence of characters and a binary file (for images, 
videos and more) as a sequence of bytes. As in lists and arrays, the first character in a 
text file and byte in a binary file is located at position o, so in a file of n characters or 
bytes, the highest position number is n — 1. The diagram below shows a conceptual view 
of a file: 

0 1 2 3 4 5 6 7 8 9 oo n-1 


end-of-file marker 


For each file you open, Python creates a file object that you'll use to interact with the 
file. 


End of File 


Every operating system provides a mechanism to denote the end of a file. Some 
represent it with an end-of-file marker (as in the preceding figure), while others 
might maintain a count of the total characters or bytes in the file. Programming 


languages generally hide these operating-system details from you. 


Standard File Objects 


When a Python program begins execution, it creates three standard file objects: 


e sys.stdin—the standard input file object 
e sys.stdout—the standard output file object, and 


e sys.stderr—the standard error file object. 


Though these are considered file objects, they do not read from or write to files by 
default. The input function implicitly uses sys. stdin to get user input from the 
keyboard. Function print implicitly outputs to sys. stdout, which appears in the 


command line. Python implicitly outputs program errors and tracebacks to 


sys.stderr, which also appears in the command line. You must import the sys 


module if you need to refer to these objects explicitly in your code, but this is rare. 


9.3 TEXT-FILE PROCESSING 


In this section, we'll write a simple text file that might be used by an accounts- 
receivable system to track the money owed by a company’s clients. We'll then read that 
text file to confirm that it contains the data. For each client, we'll store the client’s 
account number, last name and account balance owed to the company. Together, these 
data fields represent a client record. Python imposes no structure on a file, so notions 
such as records do not exist natively in Python. Programmers must structure files to 
meet their applications’ requirements. We'll create and maintain this file in order by 
account number. In this sense, the account number may be thought of as a record 
key. For this chapter, we assume that you launch IPython from the ch0 9 examples 
folder. 


9.3.1 Writing to a Text File: Introducing the with Statement 


Let’s create an accounts.txt file and write five client records to the file. Generally, 
records in text files are stored one per line, so we end each record with a newline 


character: 


lick here to view code image 


In [1]: with open('accounts.txt', mode='w') as accounts: 
accounts.write('100 Jones ZA OS Ni T) 
accounts.write('200 Doe 345,67 \n" ) 
accounts.write('300 White O00 Vn") 

('400 Stone — Ag LONT 
C300 RICH PAO 2 Nie) 


accounts.write 


accounts.write 


You can also write to a file with print (which automatically outputs a \n), as in 


lick here to view code image 


print('100 Jones 24.98', file=accounts) 


The with Statement 


Many applications acquire resources, such as files, network connections, database 


connections and more. You should release resources as soon as they're no longer 
needed. This practice ensures that other applications can use the resources. Python’s 


with statement: 


e acquires a resource (in this case, the file object for accounts.txt) and assigns its 


corresponding object to a variable (accounts in this example), 
e allows the application to use the resource via that variable, and 


e calls the resource object’s close method to release the resource when program 


control reaches the end of the with statement’s suite. 


Built-In Function open 


The built-in open function opens the file accounts. txt and associates it with a file 
object. The mode argument specifies the file-open mode, indicating whether to open 
a file for reading from the file, for writing to the file or both. The mode 'w' opens the 
file for writing, creating the file if it does not exist. If you do not specify a path to the 
file, Python creates it in the current folder (ch09). Be careful—opening a file for writing 
deletes all the existing data in the file. By convention, the . txt file extension 


indicates a plain text file. 


Writing to the File 


The with statement assigns the object returned by open to the variable accounts in 
the as clause. In the with statement’s suite, we use the variable accounts to interact 
with the file. In this case, we call the file object’s write method five times to write five 
records to the file, each as a separate line of text ending in a newline. At the end of the 
with statement’s suite, the with statement implicitly calls the file object’s close 


method to close the file. 


Contents of accounts.txt File 


After executing the previous snippet, your ch09 directory contains the file 
accounts.txt with the following contents, which you can view by opening the file in 


a text editor: 


100 Jones 24.98 
200 Doe 345.67 
300 White 0.00 
400 Stone -42.16 
DOO Rich 224762 


In the next section, you'll read the file and display its contents. 


9.3.2 Reading Data from a Text File 


We just created the text file accounts.txt and wrote data to it. Now let’s read that 
data from the file sequentially from beginning to end. The following session reads 
records from the file accounts.txt and displays the contents of each record in 
columns with the Account and Name columns left aligned and the Balance column 


right aligned, so the decimal points align vertically: 


lick here to view code image 


In [1]: with open('accounts.txt', mode='r') as accounts: 
prame (A Vvacount alot name <0 Balance > PON) 





for record in accounts: 
account, name, balance = record.split() 
print (t faccount:<10}{name:<10) (balances >10}") 


Account Name Balance 
100 Jones 24.98 
200 Doe SA or OF 
300 White 0.00 
400 Stone -42.16 
500 Rich 224 562 


If the contents of a file should not be modified, open the file for reading only. This 
prevents the program from accidentally modifying the file. You open a file for reading 
by passing the 'r' file-open mode as function open’s second argument. If you do not 
specify the folder in which to store the file, open assumes the file is in the current 
folder. 


Iterating through a file object, as shown in the preceding for statement, reads one line 
at a time from the file and returns it as a string. For each record (that is, line) in the 


file, string method sp1it returns tokens in the line as a list, which we unpack into the 





variables account, name and balance.’ The last statement in the for statement’s 


suite displays these variables in columns using field widths. 


* When splitting strings on spaces (the default), split automatically discards the 


newline character. 


File Method readlines 


The file object’s readlines method also can be used to read an entire text file. The 
method returns each line as a string in a list of strings. For small files, this works well, 
but iterating over the lines in a file object, as shown above, can be more efficient. * 


Calling readlines for a large file can be a time-consuming operation, which must 





complete before you can begin using the list of strings. Using the file object in a for 


statement enables your program to process each text line as it’s read. 


2 


ttps://docs.python.org/3/tutorial/inputoutput.html#methods- 


f-file-objects. 





Seeking to a Specific File Position 


While reading through a file, the system maintains a file-position pointer 
representing the location of the next character to read. Sometimes it’s necessary to 
process a file sequentially from the beginning several times during a program’s 
execution. Each time, you must reposition the file-position pointer to the beginning of 
the file, which you can do either by closing and reopening the file, or by calling the file 


object’s seek method, as in 





file object.seek(0) 


The latter approach is faster. 


9.4 UPDATING TEXT FILES 


Formatted data written to a text file cannot be modified without the risk of destroying 
other data. If the name 'White' needs to be changed to 'Williams' in 
accounts.txt, the old name cannot simply be overwritten. The original record for 


White is stored as 


300 White 0.00 


If you overwrite the name 'White' with the name 'Williams', the record becomes 


300 Williams00 


The new last name contains three more characters than the original one, so the 
os 99 


characters beyond the second “i” in 'Williams' overwrite other characters in the 


line. The problem is that in the formatted input—output model, records and their fields 


can vary in size. For example, 7, 14, —117, 2074 and 27383 are all integers and are 
stored in the same number of “raw data” bytes internally (typically 4 or 8 bytes in 
today’s systems). However, when these integers are output as formatted text, they 
become different-sized fields. For example, 7 is one character, 14 is two characters and 


27383 is five characters. 


To make the preceding name change, we can: 


e copy the records before 300 White 0.00 intoatemporary file, 

e write the updated and correctly formatted record for account 300 to this file, 
e copy the records after 300 White 0.00 tothe temporary file, 

e delete the old file and 


e rename the temporary file to use the original file’s name. 


This can be cumbersome because it requires processing every record in the file, even if 
you need to update only one record. Updating a file as described above is more efficient 


when an application needs to update many records in one pass of the file. ° 


3 n the chapter, Big Data: Hadoop, Spark, NoSQL and IoT, youll see that database 


systems solve this update in place problem efficiently. 


Updating accounts.txt 


Let’s use a with statement to update the accounts. txt file to change account 300’s 


name from 'White' to 'Williams' as described above: 


lick here to view code image 








Nee accounts open accounts tices, eon) 
In [2]: temp file = open('temp file.txt', tw!) 
In [312 with accounts, temp file: 


for record im accounts: 
account, name, balance = record.split() 
If account f= VS 00: 
temp fille. write (record) 
else: 


new record =- T T JoIn (account; 'Williams', balance] 





temp file.write(new_ record + '\n') 











or readability, we opened the file objects (snippets [1] and [2] ), then specified their 


variable names in the first line of snippet [3]. This with statement manages two 





resource objects, specified in a comma-separated list after with. The for statement 


unpacks each record into account, name and balance. If the account is not '300', 





we write record (which contains a newline) to temp file. Otherwise, we assemble 


the new record containing 'Williams' in place of 'White' and write it to the file. 





After snippet [3], temp file.txt contains: 


100 Jones 24.98 
200 Doe 345.67 
300 Williams 0.00 
400 Stone -42.16 
SOO Raveh® 22402 


os Module File-Processing Functions 








At this point, we have the old accounts.txt file and the new temp file.txt.To 


complete the update, let’s delete the old accounts . txt file, then rename 








temp file.txt as accounts.txt. The os module * provides functions for 
interacting with the operating system, including several that manipulate your system’s 
files and directories. Now that we've created the temporary file, let’s use the remove 


function ° to delete the original file: 
4 ttps://docs.python.org/3/library/os.html. 


5 Use remove with cautionit does not warn you that youre permanently deleting the 
file. 


lick here to view code image 


TAAIE Impor OS 


in Sl: OS. remove accounts, txt") 


Next, let’s use the rename function to rename the temporary file as 


VaGecounts txt": 


lick here to view code image 


in EG): (Os. rename (I temp Fale sext', Paceoqumes . EXE") 


9.5 SERIALIZATION WITH JSON 


Many libraries we'll use to interact with cloud-based services, such as Twitter, IBM 
Watson and others, communicate with your applications via JSON objects. JSON 
(JavaScript Object Notation) is a text-based, human-and-computer-readable, data- 
interchange format used to represent objects as collections of name—value pairs. JSON 


can even represent objects of custom classes like those you'll build in the next chapter. 


JSON has become the preferred data format for transmitting objects across platforms. 
This is especially true for invoking cloud-based web services, which are functions and 
methods that you call over the Internet. You'll become proficient at working with JSON 
data. In the “Data Mining Twitter” chapter, you'll access JSON objects containing 
tweets and their metadata. In the “IBM Watson and Cognitive Computing” chapter, 
youll access data in the JSON responses returned by Watson services. In the * ig Data: 
adoop, Spark, NoSQL and IoT” chapter, we'll store JSON tweet objects that we obtain 
from Twitter in MongoDB, a popular NoSQL database. In that chapter, we'll also work 


with other web services that send and receive data as JSON objects. 


JSON Data Format 


JSON objects are similar to Python dictionaries. Each JSON object contains a comma- 
separated list of property names and values, in curly braces. For example, the following 


key—value pairs might represent a client record: 


{MwaeccouniE e 100 vname w: “wonesi,, ““ballamcels, 24798 


JSON also supports arrays which, like Python lists, are comma-separated values in 


square brackets. For example, the following is an acceptable JSON array of numbers: 


[HOOF 2007 300] 


Values in JSON objects and arrays can be: 


e strings in double quotes (like " Jones"), 


e numbers (like 100 or 24.98), 


e JSON Boolean values (represented as true or false in JSON), 
e null (to represent no value, like None in Python), 
e arrays (like [100, 200, 300]),and 


e other JSON objects. 


Python Standard Library Module j son 


The json module enables you to convert objects to JSON (JavaScript Object 


Notation) text format. This is known as serializing the data. Consider the following 


dictionary, which contains one key—value pair consisting of the key 'accounts' with 


its associated value being a list of dictionaries representing two accounts. Each account 


dictionary contains three key—value pairs for the account number, name and balance: 


lick here to view code image 


inp Pu accounts dict- hvaccoumes [ 
{account "> 100; “mame'’s “Jenes";, “balance: 24.98). 
{account's 200; “name: "Dee", “balance’: 345.767} 1} 


Serializing an Object to JSON 


Let’s write that object in JSON format to a file: 
lick here to view code image 


Tal: amp Oct, Son 


Ea Wists vici open(Taccounzs.qsony, Mw) as saccounts:: 


json.dump(accounts dict, accounts) 


Snippet [3] opens the file accounts. 4son and uses the json module’s dump 


function to serialize the dictionary accounts dict into the file. The resulting file 


contains the following text, which we reformatted slightly for readability: 


lick here to view code image 


(MWae@couimtzs ie 
{accounts 100, “name: “Jones; “balances: 24 798), 


{Macecount > 200, “name: “Doe, “balance”: 345-67} 


Note that JSON delimits strings with double-quote characters. 


Deserializing the JSON Text 


The j son module’s load function reads the entire JSON contents of its file object 
argument and converts the JSON into a Python object. This is known as deserializing 


the data. Let’s reconstruct the original Python object from this JSON text: 


lick here to view code image 


in? A]: wath opent accounts: Json; 26") vas accounts: 


accounts json = json.load(accounts) 


We can now interact with the loaded object. For example, we can display the dictionary: 


lick here to view code image 


In [Sj]: accounts. json 

Onis Sale 

{ "accounts": [{'acecount":s: 100, 'name': 'Jones”, "“balance!: 2A 98), 
{account i 200, "namel: "“Doe!, “balance: 345.67 rI 


As you’d expect, you can access the dictionary’s contents. Let’s get the list of diction- 


aries associated with the 'accounts' key: 


lick here to view code image 


in) ok accounts soni accounts ii 

Cutie: 

H account: = T00 name: “gones r ‘balance: 24.987 
{"aecount!: 200, ‘mame’: Doel, balance: 345.67} ] 


Now, let’s get the individual account dictionaries: 


lick here to view code image 


in [yj accounts. vison] accounts |) To] 


Outil: ({taccounte': 100; “name: “domes"; “ballancets 24.987 


in el: accounts: Jison “accounts |) [1] 


OUTS]: {account : 200, “name: "Doe!, “balance: 345.67} 


Though we did not do so here, you can modify the dictionary as well. For example, you 
could add accounts to or remove accounts from the list, then write the dictionary back 
into the JSON file. 


Displaying the JSON Text 


The json module’s dumps function (dumps is short for “dump string”) returns a 
Python string representation of an object in JSON format. Using dumps with load, you 
can read the JSON from the file and display it in a nicely indented format—sometimes 
called “pretty printing” the JSON. When the dumps function call includes the indent 
keyword argument, the string contains newline characters and indentation for pretty 


printing—you also can use indent with the dump function when writing to a file: 


lick here to view code image 


Le i wath open accounts. json ,. be )y sas accounts. 
print (json.dumps(json.load(accounts), indent=4) ) 
{ 
VACCOUnES a i 
{ 
vaccount ss 007 
UMame ws. UIOnes us, 


“ballance™: 2498 


waccounte ts. 2007 
"name": "Doe", 
"balance; 345. O7 


9.6 FOCUS ON SECURITY: PICKLE SERIALIZATION AND 
DESERIALIZATION 


The Python Standard Library’s pickle module can serialize objects into in a Python- 
specific data format. Caution: The Python documentation provides the 


following warnings about pickle: 


e “Pickle files can be hacked. If you receive a raw pickle file over the network, don’t 


trust it! It could have malicious code in it, that would run arbitrary Python when 


ou try to de-pickle it. However, if you are doing your own pickle writing and 


reading, you're safe (provided no one else has access to the pickle file, of course.)” ° 


ttps://wiki.python.org/moin/UsingPickle. 

e “Pickle is a protocol which allows the serialization of arbitrarily complex Python 
objects. As such, it is specific to Python and cannot be used to communicate with 
applications written in other languages. It is also insecure by default: deserializing 
pickle data coming from an untrusted source can execute arbitrary code, if the data 


was crafted by a skilled attacker.” 7 


7 


ttps://docs.python.org/3/tutorial/inputoutput.html#reading- 





nd-writing-files. 


We do not recommend using pickle, but it’s been used for many years, so you're likely 


to encounter it in legacy code—old code that’s often no longer supported. 


9.7 ADDITIONAL NOTES REGARDING FILES 


The following table summarizes the various file-open modes for text files, including the 
modes for reading and writing we’ve introduced. The writing and appending modes 
create the file if it does not exist. The reading modes raise a FileNotFoundError if 
the file does not exist. Each text-file mode has a corresponding binary-file mode 
specified with b, asin 'rb' or 'wb+'. You'd use these modes, for example, if you were 
reading or writing binary files, such as images, audio, video, compressed ZIP files and 


many other popular custom file formats. 


Mode Description 





Open a text file for reading. This is the default if you do not specify 


the file-open mode when you call open. 


w' Open a text file for writing. Existing file contents are deleted. 


van Open a text file for appending at the end, creating the file if it does 


not exist. New data is written at the end of the file. 


‘r+! Open a text file reading and writing. 


‘wt! Open a text file reading and writing. Existing file contents are deleted. 


Open a text file reading and appending at the end. New data is 


written at the end of the file. If the file does not exist, it is created. 


Other File Object Methods 


Here are a few more useful file-object methods. 


e For a text file, the read method returns a string containing the number of 
characters specified by the method’s integer argument. For a binary file, the method 
returns the specified number of bytes. If no argument is specified, the method 


returns the entire contents of the file. 


e The readline method returns one line of text as a string, including the newline 


character if there is one. This method returns an empty string when it encounters 
the end of the file. 


e The writelines method receives a list of strings and writes its contents to a file. 


The classes that Python uses to create file objects are defined in the Python Standard 
Library’s io module ( ttps://docs.python.org/3/library/io.html). 


9.8 HANDLING EXCEPTIONS 
Various types of exceptions can occur when you work with files, including: 


e AFileNotFoundError occurs if you attempt to open a non-existent file for 


reading with the 'r' or 'r+' modes. 


e A PermissionsError occurs if you attempt an operation for which you do not 


have permission. This might occur if you try to open a file that your account is not 


allowed to access or create a file in a folder where your account does not have 


permission to write, such as where your computer’s operating system is stored. 


e AValueError (with the error message 'I/O operation on closed file.') 


occurs when you attempt to write to a file that has already been closed. 


9.8.1 Division by Zero and Invalid Input 


Let’s revisit two exceptions that you saw earlier in the book. 


Division By Zero 


Recall that attempting to divide by 0 results in a ZeroDivisionError: 





lick here to view code image 


ZeroDivision 
ipython-inp 
----> 110 / 


ZeroDivision 


TARIE 


EFO Traceback (most recent call last 
ut-1-a243dfbf119d> in <module>() 
0 





Arror: division by zero 











In this case, the interpreter is said to raise an exception of type 





ZeroDivisionError. When an exception is raised in IPython, it: 


e terminates the snippet, 


e displays the exception’s traceback, then 


e shows the next In [] prompt so you can input the next snippet. 


If an exception occurs in a script, it terminates and IPython displays the traceback. 


Invalid Input 


Recall that the int function raises a Value-Error if you attempt to convert to an 


nteger a string (like 'hel1lo') that does not represent a number: 


lick here to view code image 





In [2]: value = int(input('Enter an integer: ')) 





Enter an integer: hello 
ValueError Traceback (most recent call last 
ipython-input-2-b521605464d6> in <module>() 











----> 1 value = int (input ('Enter an integer: ')) 
ValueError: invalid literal for int() with base 10: 'hello' 
ine sale: 
4 | > 








9.8.2 try Statements 


Now let’s see how to handle these exceptions so that you can enable code to continue 
processing. Consider the following script and sample execution. Its loop attempts to 
read two integers from the user, then display the first number divided by the second. 
The script uses exception handling to catch and handle (i.e., deal with) any 
ZeroDivisionErrors and ValueErrors that arise—in this case, allowing the user to 


re-enter the input. 


lick here to view code image 




















1 # dividebyzero.py 
2"""Simple exception handling example.""" 
3 
4 while True: 
5 # attempt to convert and divide values 
6 eras 
7 numberl = int (nput(*Enter numerator: ")) 
8 number2 = int (input('Enter denominator: ')) 
9 result = numberl / number2 
10 except ValueError: # tried to convert non-numeric value to int 
11 print ('You must enter two integers\n') 
12 except ZeroDivisionError: # denominator was 0 
T3 print ("Attempted to divide by zero\n') 
14 else: # executes only if no exceptions occur 
15 print (f (numberl: 3f} / {numberZ:. Sf} = fresult:-3f 1) 
16 break # terminate the loop 
4] > 





| lick here to view code image | 


Enter numerator: 100 





Enter denominator: 0 


Attempted to divide by zero 


Enter numerator: 100 





Enter denominator: hello 


You must enter two integers 


Enter numerator: 100 








Enter denominator: 7 
100.000 / 7.000) = 145286 





try Clause 


Python uses try statements (like lines 6—16) to enable exception handling. The try 
statement’s try clause (lines 6—9) begins with keyword try, followed by a colon (: ) 


and a suite of statements that might raise exceptions. 


except Clause 


A try clause may be followed by one or more except clauses (lines 10—11 and 12—13) 
that immediately follow the t ry clause’s suite. These also are known as exception 
handlers. Each except clause specifies the type of exception it handles. In this 
example, each exception handler just displays a message indicating the problem that 


occurred. 


else Clause 


After the last except clause, an optional else clause (lines 14—16) specifies code that 
should execute only if the code in the try suite did not raise exceptions. If no 
exceptions occur in this example’s try suite, line 15 displays the division result and line 


16 terminates the loop. 


Flow of Control fora ZeroDivisionError 


Now let’s consider this example’s flow of control, based on the first three lines of the 


sample output: 
e First, the user enters 100 for the numerator in response to line 7 in the try suite. 
e Next, the user enters 0 for the denominator in response to line 8 in the try suite. 


e At this point, we have two integer values, so line 9 attempts to divide 100 by 0, 


ausing Python to raise a ZeroDivisionError. The point in the program at which 


an exception occurs is often referred to as the raise point. 


When an exception occurs in a try suite, it terminates immediately. If there are any 
except handlers following the try suite, program control transfers to the first one. If 
there are no except handlers, a process called stack unwinding occurs, which we 


discuss later in the chapter. 


In this example, there are except handlers, so the interpreter searches for the first one 


that matches the type of the raised exception: 


e The except clause at lines 10—11 handles ValueErrors. This does not match the 
type ZeroDivisionError, so that except clause’s suite does not execute and 


program control transfers to the next except handler. 


e The except clause at lines 12-13 handles ZeroDivisionErrors. This is a match, 
so that except clause’s suite executes, displaying "Attempted to divide by 


zero". 


When an except clause successfully handles the exception, program execution 





resumes with the finally clause (if there is one), then with the next statement after 
the try statement. In this example, we reach the end of the loop, so execution resumes 
with the next loop iteration. Note that after an exception is handled, program control 


does not return to the raise point. Rather, control resumes after the try statement. 





We'll discuss the finally clause shortly. 


Flow of Control fora ValueError 


Now let’s consider the flow of control, based on the next three lines of the sample 
output: 


e First, the user enters 100 for the numerator in response to line 7 in the try suite. 


e Next, the user enters hello for the denominator in response to line 8 in the try 


suite. The input is not a valid integer, so the int function raises a ValueError. 


The exception terminates the t ry suite and program control transfers to the first 
except handler. In this case, the except clause at lines 10—11 is a match, so its suite 


executes, displaying "You must enter two integers". Then, program execution 


resumes with the next statement after the try statement. Again, that’s the end of the 


loop, so execution resumes with the next loop iteration. 


Flow of Control for a Successful Division 


Now let’s consider the flow of control, based on the last three lines of the sample 


output: 


e First, the user enters 100 for the numerator in response to line 7 in the try suite. 
e Next, the user enters 7 for the denominator in response to line 8 in the try suite. 


e At this point, we have two valid integer values and the denominator is not 0, so line 


9 successfully divides 100 by 7. 


When no exceptions occur in the try suite, program execution resumes with the else 
clause (if there is one); otherwise, program execution resumes with the next statement 
after the try statement. In this example’s else clause, we display the division result, 


then terminate the loop, and the program terminates. 


9.8.3 Catching Multiple Exceptions in One except Clause 


It’s relatively common for a t ry clause to be followed by several except clauses to 
handle various types of exceptions. If several except suites are identical, you can catch 


those exception types by specifying them as a tuple in a single except handler, as in: 
except Eypel, Eypes, «.la) as variable names 


The as clause is optional. Typically, programs do not need to reference the caught 
exception object directly. If you do, you can use the variable in the as clause to 


reference the exception object in the except suite. 


9.8.4 What Exceptions Does a Function or Method Raise? 


Exceptions may surface via statements in a try suite, via functions or methods called 
directly or indirectly from a try suite, or via the Python interpreter as it executes the 


code (for example, ZeroDivisionErrors). 


Before using any function or method, read its online API documentation, which 


specifies what exceptions are thrown (if any) by the function or method and indicates 


easons why such exceptions may occur. Next, read the online API documentation for 


each exception type to see potential reasons why such an exception occurs. 


9.8.5 What Code Should Be Placed ina try Suite? 


Place in a try suite a significant logical section of a program in which several 
statements can raise exceptions, rather than wrapping a separate try statement 
around every statement that raises an exception. However, for proper exception- 
handling granularity, each try statement should enclose a section of code small 
enough that, when an exception occurs, the specific context is known and the except 
handlers can process the exception properly. If many statements in a try suite raise 
the same exception types, multiple try statements may be required to determine each 


exception’s context. 


9.9 FINALLY CLAUSE 


Operating systems typically can prevent more than one program from manipulating a 
file at once. When a program finishes processing a file, the program should close it to 
release the resource so other programs can access it. Closing the file helps prevent a 


resource leak. 


The finally Clause of the try Statement 





A try statement may have a finally clause after any except clauses or the else 
clause. The finally clause is guaranteed to execute. 8 In other languages that have 


finally, this makes the finally suite an ideal location to place resource- 





deallocation code for resources acquired in the corresponding try suite. In Python, we 


prefer the with statement for this purpose and place other kinds of “clean up” code in 





the finally suite. 





8 The only reason a finally suite will not execute if program control enters the 
corresponding try suite is if the application terminates first, for example by calling the 


sys modules exit function. 


Example 





The following IPython session demonstrates that the finally clause always executes, 
regardless of whether an exception occurs in the corresponding try suite. First, let’s 


consider a try statement in which no exceptions occur in the try suite: 


lick here to view code image 


Take WAL ye Ery: 











printe (Tery surte with no exceptions raised') 
exCepir: 

print (thts will mo execute') 
else: 

print('else executes because no exceptions in the try suite' 
finaliiys 

printe (i tctnalily always executes!) 





try suite with no exceptions raised 





lis xecutes because no exceptions in the try suite 


finally always executes 


era [Pal lee 














The preceding t ry suite displays a message but does not raise any exceptions. When 
program control successfully reaches the end of the try suite, the except clause is 


skipped, the else clause executes and the finally clause displays a message showing 





that it always executes. When the finally clause terminates, program control 
continues with the next statement after the try statement. In an [Python session, the 


next In [] prompt appears. 
Now let’s consider a try statement in which an exception occurs in the try suite: 


lick here to view code image 


im 2 ese erie 


prine (Cery sulte that raises an exception') 
inme Gh ire rkgo 
printe (ehis will mot execute') 





except ValueError: 








print('a ValueError occurred') 
else: 

print (telse will not execute because an exception occurred') 
finally: 

printe (t Einally always executes') 


try suite that raises an exception 





a ValueError occurred 


finally always executes 


Tan: 

















This try suite begins by displaying a message. The second statement attempts to 


convert the string 'hello' to an integer, which causes the int function to raise a 


ValueError. The try suite immediately terminates, skipping its last print 
statement. The except clause catches the ValueError exception and displays a 
message. The else clause does not execute because an exception occurred. Then, the 


finally clause displays a message showing that it always executes. When the 





finally clause terminates, program control continues with the next statement after 


the try statement. In an IPython session, the next In [] prompt appears. 


Combining with Statements and try except Statements 


Most resources that require explicit release, such as files, network connections and 
database connections, have potential exceptions associated with processing those 
resources. For example, a program that processes a file might raise IOErrors. For this 
reason, robust file-processing code normally appears in a try suite containing a with 
statement to guarantee that the resource gets released. The code is in a try suite, so 


you can catch in except handlers any exceptions that occur and you do not need a 





finally clause because the with statement handles resource deallocation. 


To demonstrate this, first let’s assume youre asking the user to supply the name of a 
file and they provide that name incorrectly, such as gradez.txt rather than the file 
we created earlier grades.txt. In this case, the open call raises a 


FileNotFoundError by attempting to open a non-existent file: 


lick here to view code image 


FileNotFoundError Traceback (most recent call last 
ipython-input-3-b7£41b2d5969> in <module>() 
= 1) open gqradez. txt") 





FileNotFoundError: [Errno 2] No such file or directory: 'gradez.txt' 











To catch exceptions like FileNotFoundError that occur when you try to open a file 


for reading, wrap the with statement in a try suite, as in: 


lick here to view code image 


in TA: Ery: 
With open ("gradez. txt"; "x" ) as ‘accounts: 
prine Ch (Dts <S Name F<} Graden) 


for record an accounts: 


student_id, name, grade = record split () 
print (f'{student_id:<3} {name:<7}{grade}') 





: except FileNotFoundError: 


print ('The file name you specified does not exist') 


The file name you specified does not exist 


9.10 EXPLICITLY RAISING AN EXCEPTION 


You’ve seen various exceptions raised by your Python code. Sometimes you might need 
to write functions that raise exceptions to inform callers of errors that occur. The 
raise statement explicitly raises an exception. The simplest form of the raise 


statement is 
raise ExceptionClassName 


The raise statement creates an object of the specified exception class. Optionally, the 
exception class name may be followed by parentheses containing arguments to 
initialize the exception object—typically to provide a custom error message string. Code 
that raises an exception first should release any resources acquired before the exception 


occurred. In the next section, we'll show an example of raising an exception. 


In most cases, when you need to raise an exception, it’s recommended that you use one 


of Python’s many built-in exception types ° listed at: 


° You may be tempted to create custom exception classes that are specific to your 


application. Well say more about custom exceptions in the next chapter. 
ttps://docs.python.org/3/library/exceptions.html 


9.11 (OPTIONAL) STACK UNWINDING AND 
TRACEBACKS 


Each exception object stores information indicating the precise series of function calls 


that led to the exception. This is helpful when debugging your code. Consider the 








following function definitions—function1 calls function2 and function2 








raises an Exception: 


lick here to view code image 


in (Ss def functronl(): 


füunetronz2() 


in 2l: det function 2) 





raise Exception('An exception occurred') 





Calling function] results in the following traceback. For emphasis, we placed in bold 


the parts of the traceback indicating the lines of code that led to the exception: 


lick here to view code image 


Exception Traceback (most recent call last 
ipython-input-3-cOb3cafe2087> in <module>() 
=--> 1 functioni() 


<ipython-input-1-a9f4faeeeb0c> in functionl () 
i det functronde():: 

----> 2 function2 () 
9 





<ipython-input-2-c65el9d6b45b> in function2 () 
I def “fumetion2 (yl: 
----> 2 raise Exception('An exception occurred') 





Exception: An exception occurred 











Traceback Details 


The traceback shows the type of exception that occurred (Exception) followed by the 
complete function call stack that led to the raise point. The stack’s bottom function call 
is listed first and the top is last, so the interpreter displays the following text as a 


reminder: 


Traceback (most recent call last) 


In this traceback, the following text indicates the bottom of the function-call stack—the 


function! callin snippet [3] (indicated by ipython-input-3): 


<ipython-input-3-c0b3cafe2087> in <module>() 


----> 1 functionl () 








Next, we see that function! called function2 from line 2 in snippet [1]: 


<ipython-input-l-a9f4faeeebO0c> in functionl () 
i def fuUnctEroni): 

----> 2 function () 
S) 


Finally, we see the raise point—in this case, line 2 in snippet [2] raised the exception: 





<ipython-input-2-c6é5el9d6b45b> in function2 C) 
i def function2(): 


----> 2 raise Exception('An exception occurred') 





Stack Unwinding 


In our previous exception-handling examples, the raise point occurred in a try suite, 
and the exception was handled in one of the try statement’s corresponding except 
handlers. When an exception is not caught in a given function, stack unwinding 


occurs. Let’s consider stack unwinding in the context of this example: 


e In function2, the raise statement raises an exception. This is not in a try suite, 





so function2 terminates, its stack frame is removed from the function-call stack, 





and control returns to the statement in function1 that called function?2. 





e In functionl, the statement that called function2 is not ina try suite, so 
functionl terminates, its stack frame is removed from the function-call stack, and 
control returns to the statement that called function1—snippet [3] in the 


IPython session. 


e The call in snippet [3] call is not ina try suite, so that function call terminates. 
Because the exception was not caught (known as an uncaught exception), 
IPython displays the traceback, then awaits your next input. If this occurred in a 


typical script, the script would terminate. ° 


°In more advanced applications that use threads, an uncaught exception 
terminates only the thread in which the exception occurs, not necessarily the entire 


application. 


Tip for Reading Tracebacks 


You'll often call functions and methods that belong to libraries of code you did not 


write. Sometimes those functions and methods raise exceptions. When reading a 
traceback, start from the end of the traceback and read the error message first. Then, 
read upward through the traceback, looking for the first line that indicates code you 
wrote in your program. Typically, this is the location in your code that led to the 


exception. 


Exceptions in finally Suites 





Raising an exception in a finally suite can lead to subtle, hard-to-find problems. If 





an exception occurs and is not processed by the time the finally suite executes, stack 





unwinding occurs. If the finally suite raises a new exception that the suite does not 


catch, the first exception is Jost, and the new exception is passed to the next enclosing 





try statement. For this reason, a finally suite should always enclose in a try 
statement any code that may raise an exception, so that the exceptions will be 
processed within that suite. 


9.12 INTRO TO DATA SCIENCE: WORKING WITH CSV 
FILES 


Throughout this book, you'll work with many datasets as we present data-science 
concepts. CSV (comma-separated values) is a particularly popular file format. In 
this section, we'll demonstrate CSV file processing with a Python Standard Library 


module and pandas. 


9.12.1 Python Standard Library Module csv 


The csv module-* provides functions for working with CSV files. Many other Python 


libraries also have built-in CSV support. 


1 


ttps://docs .-python.org/3/library/csv.html1. 


Writing to a CSV File 


Let’s create an accounts.csv file using CSV format. The csv module’s 
documentation recommends opening CSV files with the additional keyword argument 


newline="' to ensure that newlines are processed properly: 


lick here to view code image 


im [1]: amport csv 


in 2]: wath open\(*accounts csv", mode='w', newline='') as accounts: 








writer = csv.writer (accounts) 
writer.writerow([100, Yuones”, 24:921) 
writer.writerow([200, Drover ee Silay ec ||) 
writer.writerow([300, VWinase:” 77 0-00J) 
writer.writerow([400, “Scone A2. T6) 
writer.writerow([500, RICH 224r e2 
4 | i> 





The .csv file extension indicates a CSV-format file. The csv module’s writer 
function returns an object that writes CSV data to the specified file object. Each call to 
the writer’s writerow method receives an iterable to store in the file. Here we’re 
using lists. By default, writerow delimits values with commas, but you can specify 


custom delimiters. * After the preceding snippet, accounts. csv contains: 


2 


ttps://docs.python.org/3/library/csv.html#csv-fmt-params. 


100,Jones,24.98 
200, Doe, 345.67 
300,White,0.00 
400, Stone, -42.16 
500, Ricen 224r 02 


CSV files generally do not contain spaces after commas, but some people use them to 
enhance readability. The writerow calls above can be replaced with one writerows 


call that outputs a comma-separated list of iterables representing the records. 


If you write data that contains commas within a given string, writerow encloses that 


string in double quotes. For example, consider the following Python list: 


1100; Jones, Suet, 24.98] 


The single-quoted string ' Jones, Sue' contains a comma separating the last name 


and first name. In this case, wri terow would output the record as 
100,"Jones, Sue", 24.98 
The quotes around "Jones, Sue" indicate that this is a single value. Programs 


reading this from a CSV file would break the record into three pieces—100, ' Jones, 


Sue' and 24.98. 


Reading from a CSV File 


Now let’s read the CSV data from the file. The following snippet reads records from the 
file accounts.csv and displays the contents of each record, producing the same 


output we showed earlier: 


lick here to view code image 


im [Silky with vopem (accounts sesv \. Yael, newline='') as accounts: 
Del miele {VACcounE <0} Names <lOi ("Balance s:> 10i})) 





reader = csv.reader(accounts) 
for record in reader: 
account, name, balance = record 


print (f' {account:<10} {name:<10} {balance:>10}"') 





Account Name Balance 
100 Jones 24.98 
200 Doe 345.67 
300 White ORO 
400 Stone =42 <16 
500 Rich 224.62 


The csv module’s reader function returns an object that reads CSV-format data 
from the specified file object. Just as you can iterate through a file object, you can 
iterate through the reader object one record of comma-delimited values at a time. The 


preceding for statement returns each record as a list of values, which we unpack into 





the variables account, name and balance, then display. 


Caution: Commas in CSV Data Fields 


Be careful when working with strings containing embedded commas, such as the name 
"Jones, Sue'. Ifyou accidentally enter this as the two strings 'Jones' and 'Sue', 
then writerow would, of course, create a CSV record with four fields, not three. 

Programs that read CSV files typically expect every record to have the same number of 


fields; otherwise, problems occur. For example, consider the following two lists: 


[LOO "Jones! “Suet 24798] 
[200, 'Doe' po silo (ou? || 


The first list contains four values and the second contains only three. If these two 
records were written into the CSV file, then read into a program using the previous 
snippet, the following statement would fail when we attempt to unpack the four-field 


record into only three variables: 


account, name, balance = record 


Caution: Missing Commas and Extra Commas in CSV Files 


Be careful when preparing and processing CSV files. For example, suppose your file is 


composed of records, each with four comma-separated int values, such as: 
100,85,77,9 

If you accidentally omit one of these commas, as in: 
100,8577,9 


then the record has only three fields, one with the invalid value 8577. 


If you put two adjacent commas where only one is expected, as in: 
100,85,,77,9 


then you have five fields rather than four, and one of the fields erroneously would be 
empty. Each of these comma-related errors could confuse programs trying to process 


the record. 


9.12.2 Reading CSV Files into Pandas DataFrames 


In the Intro to Data Science sections in the previous two chapters, we introduced many 
pandas fundamentals. Here, we demonstrate pandas’ ability to load files in CSV format, 


then perform some basic data-analysis tasks. 


Datasets 


In the data-science case studies, we'll use various free and open datasets to 
demonstrate machine learning and natural language processing concepts. There’s an 
enormous variety of free datasets available online. The popular Rdatasets repository 
provides links to over 1100 free datasets in comma-separated values (CSV) format. 
These were originally provided with the R programming language for people learning 
about and developing statistical software, though they are not specific to R. They are 


now available on GitHub at: 


ttps://vincentarelbundock.github.io/Rdatasets/datasets.html 


This repository is so popular that there’s a pydataset module specifically for 
accessing Rdatasets. For instructions on installing pydataset and accessing datasets 


with it, see: 


ttps://github.com/iamaziz/PyDataset 


Another large source of datasets is: 


ttps://github.com/awesomedata/awesome-public-datasets 


A commonly used machine-learning dataset for beginners is the Titanic disaster 
dataset, which lists all the passengers and whether they survived when the ship 
Titanic struck an iceberg and sank April 14-15, 1912. We'll use it here to show how to 
load a dataset, view some of its data and display some descriptive statistics. We’ll dig 


deeper into a variety of popular datasets in the data-science chapters later in the book. 


Working with Locally Stored CSV Files 


You can load a CSV dataset into a DataFrame with the pandas function read_csv. 
The following loads and displays the CSV file accounts. csv that you created earlier 


in this chapter: 


lick here to view code image 


in (ijk: ampere pandas as pd 


hn Ak ot = pd readies val tACeoOuUmes wes v i, 
names=['account', "name', 'balance']) 

TAS EAE 
Out kol: 

account name balance 
0 100 Jones 24.98 
il 200 Doe 345.67 
2 300 White 0.00 
3 400 Stone -42.16 
4 500 Rich 224 62. 


The names argument specifies the DataFrame’s column names. Without this 


argument, read_csv assumes that the CSV file’s first row is a comma-delimited list of 


column names. 
To save a DataFrame to a file using CSV format, call DataFrame method to_csv: 


lick here to view code image 
hay ake dft esy C accounts from datakrame.eSv "4, index=False) 


The index=False keyword argument indicates that the row names (0—4 at the left of 
the DataFrame’s output in snippet [3] ) are not written to the file. The resulting file 


contains the column names as the first row: 


account,name, balance 
100,Jones,24.98 
200, Doe, 345.67 
300,White,0.0 

400, Stone, -42.16 
5007 Rich, 22462 


9.12.3 Reading the Titanic Disaster Dataset 


The Titanic disaster dataset is one of the most popular machine-learning datasets. The 


dataset is available in many formats, including CSV. 


Loading the Titanic Dataset via a URL 


If you have a URL representing a CSV dataset, you can load it into a DataFrame with 


read csv. Let’s load the Titanic Disaster dataset directly from GitHub: 


lick here to view code image 


ine amnpore pandas as pd 


in J-l: cicanie = pdpmeadvesyv ( hetpsz// vincentarelbundock, github. io + 





"Rdatasets/csv/carData/TitanicSurvival.csv') 





J os 














Viewing Some of the Rows in the Titanic Dataset 


This dataset contains over 1300 rows, each representing one passenger. According to 


Wikipedia, there were approximately 1317 passengers and 815 of them died. ? For large 


datasets, displaying the DataFrame shows only the first 30 rows, followed by “...” and 


the last 30 rows. To save space, let’s view the first five and last five rows with 


DataFrame methods head and tail. Both methods return five rows by default, but 


you can specify the number of rows to display as an argument: 


3 ttps://en.wikipedia.org/wiki/Passengers of the RMS Titanic. 


lick here to view code image 








mei: pa -set option (precision, 2) # format fox 
n lA]: titanic:-head() 
Out[4]: 
Unnamed: 0 survived sex age 
Allen, Miss. Elisabeth Walton yes female 29r 
Allison, Master. Hudson Trevor yes male 0 
Allison, Miss. Helen Loraine no female 2 
Allison, Mr. Hudson Joshua Crei no male 305 
Allison, Mrs. Hudson J C (Bessi no female 256 
nisle titamie taiii) 
Out LSI: 
Unnamed: 0 survived sex age 
1304 Zabour, Miss. Hileni no female 14.50 
1305 Zabour, Miss. Thamine no female NaN 
1306 Zakarian, Mr. Mapriededer no male Gro 0) 
L307 Zakarian, Mr, Ortin no male 22010 
1308 Zimmerman, Mr. Leo no male 29700 








floating-point val 


passengerClas 

00 HES 
a2. HES 
700 IS 
00 IS 
00 ES 
passengerClass 
Src 

3rd 

Sra 

Sra 

Sra 











Note that pandas adjusts each column’s width, based on the widest value in the column 


or based on the column name, whichever is wider. Also, note the value in the age 


column of row 1305 is NaN (not a number), indicating a missing value in the dataset. 


Customizing the Column Names 


The first column in this dataset has a strange name ('Unnamed: 0'). We can clean 


that up by setting the column names. Let’s change 'Unnamed: 0' to 'name' and let’s 


shorten 'passengerClass'to'class!: 


lick here to view code image 


Taniel: 


Taa] 
Guel: 


titanic.columns = ['name', 


titanic.head() 


eu 


rvived', "sex", 


taget, 


Vetasc | 


name survived sex age class 








0 Allen, Miss. Elisabeth Walton yes female Zom OO Site 
il Allison, Master. Hudson Trevor yes male OZ Sie: 
2 Allison, Miss. Helen Loraine no female 200 lst 
3 Allison, Mr. Hudson Joshua Crer no male 3000 ish 
4 Allison, Mrs. Hudson JC (Bessi no female 25010 TSE 


9.12.4 Simple Data Analysis with the Titanic Disaster Dataset 


Now, you can use pandas to perform some simple analysis. For example, let’s look at 
some descriptive statistics. When you call describe on a DataFrame containing both 
numeric and non-numeric columns, describe calculates these statistics only for the 


numeric columns—in this case, just the age column: 


In [8]: titanic.describe() 
Owe le 

age 
count 1046.00 
mean 29188 
std AREA 
min Oral 
25% 24100 
50% 28.00 
15% 3:92.00 
max 80.00 


Note the discrepancy in the count (1046) vs. the dataset’s number of rows (1309—the 
last row’s index was 1308 when we called tail). Only 1046 (the count above) of the 
records contained an age value. The rest were missing and marked as NaN, as in row 
1305. When performing calculations, Pandas ignores missing data (NaN) by default. 
For the 1046 people with valid ages, the average (mean) age was 29. 88 years old. The 
youngest passenger (min) was just over two months old (0.17 * 12is2.04), andthe 
oldest (max) was 80. The median age was 28 (indicated by the 50% quartile). The 25% 
quartile is the median age in the first half of the passengers (sorted by age), and the 75% 


quartile is the median of the second half of passengers. 


Let’s say you want to determine some statistics about people who survived. We can 
compare the survived column to 'yes' to get anew Series containing 


True/False values, then use describe to summarize the results: 


lick here to view code image 





In [9]: (titanic.survived == 'yes').describe() 


Owe Tol: 


count 1309 
unique 2 
top False 
freq 809 


Name: survived, dtype: object 


For non-numeric data, describe displays different descriptive statistics: 


e count is the total number of items in the result. 


e unique is the number of unique values (2) in the result—T rue (survived) and 


False (died). 
e top is the most frequently occurring value in the result. 
e freq is the number of occurrences of the top value. 


9.12.5 Passenger Age Histogram 


Visualization is a nice way to get to know your data. Pandas has many built-in 
visualization capabilities that are implemented with Matplotlib. To use them, first 
enable Matplotlib support in IPython: 


in Mol: smatplot iip 


A histogram visualizes the distribution of numerical data over a range of values. A 
DataFrame’s hist method automatically analyzes each numerical column’s data and 
produces a corresponding histogram. To view histograms of each numerical data 


column, call hist on your DataFrame: 


lick here to view code image 
To lis hestogram = titanic inise) 


The Titanic dataset contains only one numerical data column, so the diagram shows 
one histogram for the age distribution. For datasets with multiple numerical columns, 


hist creates a separate histogram for each numerical column. 
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9.13 WRAP-UP 


In this chapter, we introduced text-file processing and exception handling. Files are 
used to store data persistently. We discussed file objects and mentioned that Python 
views a file as a sequence of characters or bytes. We also mentioned the standard file 


objects that are automatically created for you when a Python program begins executing. 


We showed how to create, read, write and update text files. We considered several 
popular file formats—plain text, JSON (JavaScript Object Notation) and CSV (comma- 
separated values). We used the built-in open function and the with statement to open 
a file, write to or read from the file and automatically close the file to prevent resource 
leaks when the with statement terminates. We used the Python Standard Library’s 
json module to serialize objects into JSON format and store them in a file, load JSON 
objects from a file, deserialize them into Python objects and pretty-print a JSON object 
for readability. 


We discussed how exceptions indicate execution-time problems and listed the various 
exceptions you've already seen. We showed how to deal with exceptions by wrapping 
code in try statements that provide except clauses to handle specific types of 
exceptions that may occur in the try suite, making your programs more robust and 


fault-tolerant. 


e discussed the try statement’s finally clause for executing code if program flow 





entered the corresponding try suite. You can use either the with statement or a try 





statement’s finally clause for this purpose—we prefer the with statement. 





In the Intro to Data Science section, we used both the Python Standard Library’s csv 
module and capabilities of the pandas library to load, manipulate and store CSV data. 
Finally, we loaded the Titanic disaster dataset into a pandas DataFrame, changed 
some column names for readability, displayed the head and tail of the dataset, and 
performed simple analysis of the data. In the next chapter, we'll discuss Python’s 


object-oriented programming capabilities. 


https://avxhm.se/blogs/hillO 


10. Object-Oriented Programming 


Objectives 

In this chapter, you'll: 

m Create custom classes and objects of those classes. 
m Understand the benefits of crafting valuable classes. 
m Control access to attributes. 

m Appreciate the value of object orientation. 


m Use Python special methods repr, str and format __ to get an object’s 








string representations. 


m Use Python special methods to overload (redefine) operators to use them with objects of 


new classes. 


m Inherit methods, properties and attributes from existing classes into new classes, then 


customize those classes. 


m Understand the inheritance notions of base classes (superclasses) and derived classes 


(subclasses). 

m Understand duck typing and polymorphism that enable “programming in the general.” 
m Understand class object from which all classes inherit fundamental capabilities. 

m Compare composition and inheritance. 

m Build test cases into docstrings and run these tests with doctest, 

m Understand namespaces and how they affect scope. 
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10.1 INTRODUCTION 


ection 1.2 introduced the basic terminology and concepts of object-oriented programming. 
Everything in Python is an object, so you’ve been using objects constantly throughout this 
book. Just as houses are built from blueprints, objects are built from classes—one of the core 
technologies of object-oriented programming. Building a new object from even a large class is 


simple—you typically write one statement. 


Crafting Valuable Classes 


You've already used lots of classes created by other people. In this chapter you'll create your 
own custom classes. You'll focus on “crafting valuable classes” that help you meet the 
requirements of the applications you'll build. You'll use object-oriented programming with its 
core technologies of classes, objects, inheritance and polymorphism. Software applications 
are becoming larger and more richly functional. Object-oriented programming makes it 
easier for you to design, implement, test, debug and update such edge-of-the-practice 
applications. Read ections 10.1 through 0.9 for a code-intensive introduction to these 
technologies. Most people can skip ections 10.10 through 0.15, which provide additional 


perspectives on these technologies and present additional related features. 


Class Libraries and Object-Based Programming 


The vast majority of object-oriented programming you'll do in Python is object-based 
programming in which you primarily create and use objects of existing classes. You’ve been 
doing this throughout the book with built-in types like int, float, str, list, tuple, dict 


nd set, with Python Standard Library types like Decimal, and with NumPy arrays, 


Matplotlib Figures and Axes, and pandas Series and DataFrames. 


To take maximum advantage of Python you must familiarize yourself with lots of preexisting 
classes. Over the years, the Python open-source community has crafted an enormous number 
of valuable classes and packaged them into class libraries. This makes it easy for you to reuse 
existing classes rather than “reinventing the wheel.” Widely used open-source library classes 
are more likely to be thoroughly tested, bug free, performance tuned and portable across a 
wide range of devices, operating systems and Python versions. You'll find abundant Python 
libraries on the Internet at sites like GitHub, BitBucket, Source Forge and more—most easily 
installed with conda or pip. This is a key reason for Python’s popularity. The vast majority 


of the classes you'll need are likely to be freely available in open-source libraries. 


Creating Your Own Custom Classes 


Classes are new data types. Each Python Standard Library class and third-party library class 


is a custom type built by someone else. In this chapter, you’ll develop application-specific 








classes, like CommissionEmployee, Time, Card, DeckOfCards and more. 


Most applications you'll build for your own use will commonly use either no custom classes 
or just a few. If you become part of a development team in industry, you may work on 
applications that contain hundreds, or even thousands, of classes. You can contribute your 
custom classes to the Python open-source community, but you are not obligated to do so. 
Organizations often have policies and procedures related to open-sourcing code. 


Inheritance 


Perhaps most exciting is the notion that new classes can be formed through inheritance and 
composition from classes in abundant class libraries. Eventually, software will be constructed 
predominantly from standardized, reusable components just as hardware is 
constructed from interchangeable parts today. This will help meet the challenges of 


developing ever more powerful software. 


When creating a new class, instead of writing all new code, you can designate that the new 
class is to be formed initially by inheriting the attributes (variables) and methods (the class 
version of functions) of a previously defined base class (also called a superclass). The new 
class is called a derived class (or subclass). After inheriting, you then customize the 
derived class to meet the specific needs of your application. To minimize the customization 
effort, you should always try to inherit from the base class that’s closest to your needs. To do 
that effectively, you should familiarize yourself with the class libraries that are geared to the 


kinds of applications you'll be building. 


Polymorphism 


We explain and demonstrate polymorphism, which enables you to conveniently program 
“in the general” rather than “in the specific.” You simply send the same method call to objects 


possibly of many different types. Each object responds by “doing the right thing.” So the same 


ethod call takes on “many forms,” hence the term “poly-morphism.” We'll explain how to 
implement polymorphism through inheritance and a Python feature called duck typing. We'll 


explain both and show examples of each. 


An Entertaining Case Study: Card-Shuffling-and-Dealing Simulation 


You’ve already used a random-numbers-based die-rolling simulation and used those 
techniques to implement the popular dice game craps. Here, we present a card-shuffling-and- 
dealing simulation, which you can use to implement your favorite card games. You'll use 
Matplotlib with attractive public-domain card images to display the full deck of cards both 
before and after the deck is shuffled. 


Data Classes 


Python 3.7’s new data classes help you build classes faster by using a more concise notation 
and by autogenerating portions of the classes. The Python community’s early reaction to data 
classes has been positive. As with any major new feature, it may take time before it’s widely 


used. We present class development with both the older and newer technologies. 


Other Concepts Introduced in This Chapter 


Other concepts we present include: 


e Howto specify that certain identifiers should be used only inside a class and not be 


accessible to clients of the class. 


e Special methods for creating string representations of your classes’ objects and specifying 
how objects of your classes work with Python’s built-in operators (a process called 


operator overloading). 


e An introduction to the Python exception class hierarchy and creating custom exception 


classes. 
e Testing code with the Python Standard Library's doctest module. 


e How Python uses namespaces to determine the scopes of identifiers. 


10.2 CUSTOM CLASS ACCOUNT 


Let’s begin with a bank Account class that holds an account holder’s name and balance. An 
actual bank account class would likely include lots of other information, such as address, 
birth date, telephone number, account number and more. The Account class accepts 


deposits that increase the balance and withdrawals that decrease the balance. 


10.2.1 Test-Driving Class Account 


Each new class you create becomes a new data type that can be used to create objects. This is 


one reason why Python is said to be an extensible language. Before we look at class 


Account’s definition, let’s demonstrate its capabilities. 


Importing Classes Account and Decimal 


To use the new Account class, launch your [Python session from the ch10 examples folder, 


then import class Account: 


lick here to view code image 


in? [Lis trom account Import Account 





Class Account maintains and manipulates the account balance as a Decimal, so we also 





import class Decimal: 


lick here to view code image 


In [2]: from decimal import Decimal 


Create an Account Object with a Constructor Expression 





To create a Decimal object, we can write: 


value = Decimal ('12.34') 


This is known as a constructor expression because it builds and initializes an object of the 
class, similar to the way a house is constructed from a blueprint then painted with the buyer’s 
preferred colors. Constructor expressions create new objects and initialize their data using 

argument(s) specified in parentheses. The parentheses following the class name are required, 


even if there are no arguments. 


Let’s use a constructor expression to create an Account object and initialize it with an 


account holder’s name (a string) and balance (a Decimal): 


lick here to view code image 


in [3]: accountl = Account ("John Green', Decimal('50.00')) 


Getting an Account’s Name and Balance 


Let’s access the Account object’s name and balance attributes: 


lick here to view code image 


In [4]: account1l.name 
Out[4]: 'John Green' 


In [5]: accounti balance 
outlsl: Decima M 507001) 


Depositing Money into an Account 


An Account’s deposit method receives a positive dollar amount and adds it to the balance: 
lick here to view code image 


In [6]: accountl.deposit (Decimal ('25.53")) 


In [7]: accounti balance 
Guti Decimal’ (5.53) 


Account Methods Perform Validation 


Class Account’s methods validate their arguments. For example, if a deposit amount is 





negative, deposit raises a ValueError: 


lick here to view code image 


ValueError Traceback (most recent call last) 
<ipython-input-8-27dc468365a7> in <module>() 
== > 1 account! deposit (Decimail("%=123.45"))) 





~/Documents/examples/chl10/account.py in deposit (self, amount) 
Dale # if amount is less than 0.00, raise an exception 
22 if amount < Decimal ('0.00'): 
---> 23 raise ValueError('Deposit amount must be positive.') 
24 
25 self.balance += amount 
ValueError: Deposit amount must be positive. 





10.2.2 Account Class Definition 


Now, let’s look at Account’s class definition, which is located in the file account . py. 


Defining a Class 


A class definition begins with the keyword class (line 5) followed by the class’s name and a 
colon (: ). This line is called the class header. The Style Guide for Python Code 


recommends that you begin each word in a multi-word class name with an uppercase letter 





(for example, CommissionEmployee). Every statement in a class’s suite is indented. 


lick here to view code image 


1 # account.py 


2 "“MVACCOUNE class: definition." 

3 from decimal import Decimal 

4 

5 elass Accounts 

6 Tinno count Glass Lor mainta ning a Dank account balance. ia 
7 


Each class typically provides a descriptive docstring (line 6). When provided, it must appear 
in the line or lines immediately following the class header. To view any class’s docstring in 


IPython, type the class name and a question mark, then press Enter: 


lick here to view code image 


TA (Ls Account? 

nit signature: Account (name, balance) 

Docstring: Account class for maintaining a bank account balance. 
Init docstring: Initialize an Account object. 

File: ~/Documents/examples/ch10/account.py 


Types type 











The identifier Account is both the class name and the name used in a constructor expression 
to create an Account object and invoke the class’s__ init _ method. For this reason, 


IPython’s help mechanism shows both the class’s docstring ("Docstring:") and the 








__init _ method’s docstring ("Init docstring:"). 


Initializing Account Objects: Method = init _ 


The constructor expression in snippet [3] from the preceding section: 
lick here to view code image 
accountl = Account ('John Green', Decimal ('50.00')) 
creates a new object, then initializes its data by calling the class’s__ init __ method. Each 


new class you create can providean init _ method that specifies how to initialize an 


object’s data attributes. Returning a value other than None from init __ results in a 





TypeError. Recall that None is returned by any function or method that does not contain a 
return statement. Class Account’s_ init _ method (lines 8—16) initializes an Account 


object’s name and balance attributes if the balance is valid: 


lick here to view code image 


8 def init (self, name, balance): 

9 wiMinitaalize an Account obJects Miu 

10 

11 # if balance is less than 0.00, raise an exception 
12 if balance < Decimal('0.00'): 


13 raise ValueError('Initial balance must be >= to 0.00."') 


14 


15 self.name = name 
16 self.balance = balance 
I7 


When you call a method for a specific object, Python implicitly passes a reference to that 
object as the method’s first argument. For this reason, all methods of a class must specify at 
least one parameter. By convention most Python programmers call a method’s first 
parameter self. A class’s methods must use that reference (se1f) to access the object’s 
attributes and other methods. Class Account’s init _ method also specifies parameters 


for the name and balance. 


The if statement validates the balance parameter. If balance is less than 0.00, 





__init__ raisesa ValueError, which terminates the init method. Otherwise, the 


method creates and initializes the new Account object’s name and balance attributes. 


When an object of class Account is created, it does not yet have any attributes. They’re 


added dynamically via assignments of the form: 


self.attribute name = valu 





Python classes may define many special methods, like init __, each identified by 
leading and trailing double-underscores (__) in the method name. Python class object, 
which we'll discuss later in this chapter, defines the special methods that are available for all 
Python objects. 


Method deposit 


The Account class’s deposit method adds a positive amount to the account’s balance 





attribute. If the amount argument is less than 0 . 00, the method raises a ValueError, 
indicating that only positive deposit amounts are allowed. If the amount is valid, line 25 adds 


it to the object’s balance attribute. 


lick here to view code image 





18 def deposit(self, amount): 

19 """Deposit money to the ACCNT. u 

20 

21 # if amount is less than 0.00, raise an exception 
22 if amount < Decimal ("0.00"): 

23 raise ValueError('amount must be positives”) 
24 

25 self.balance += amount 


10.2.3 Composition: Object References as Members of Classes 


An Account has a name, and an Account has a balance. Recall that “everything in Python 


is an object.” This means that an object’s attributes are references to objects of other classes. 
For example, an Account object’s name attribute is a reference to a string object and an 
Account object’s balance attribute is a reference to a Decimal object. Embedding 
references to objects of other types is a form of software reusability known as composition 
and is sometimes referred to as the “has a” relationship. Later in this chapter, we'll 
discuss inheritance, which establishes “is a” relationships. 


10.3 CONTROLLING ACCESS TO ATTRIBUTES 


Class Account’s methods validate their arguments to ensure that the balance is always 
valid—that is, always greater than or equal to 0. 00. In the previous example, we used the 
attributes name and balance only to get the values of those attributes. It turns out that we 
also can use those attributes to modify their values. Consider the Account object in the 


following IPython session: 


lick here to view code image 








In [ij]: from account import Account 

In [2]: from decimal import Decimal 

im [Si account! = Account Jonn Green", ‘Decumals('5'0). 010") ) 
In [4]: accounti. balance 

Out[4]: Decimal('50.00") 


Initially, account1 contains a valid balance. Now, let’s set the balance attribute to an 


invalid negative value, then display the balance: 


lick here to view code image 


In [5]: accountl.balance = Decimal('-1000.00') 


In [6]: account1l.balance 
Cut fools Decimal yy =1000. 100") 


Snippet [6]’s output shows that account1’s balance is now negative. As you can see, 


unlike methods, data attributes cannot validate the values you assign to them. 


Encapsulation 


A class’s client code is any code that uses objects of the class. Most object-oriented 
programming languages enable you to encapsulate (or hide) an object’s data from the client 
code. Such data in these languages is said to be private data. 


Leading Underscore (_) Naming Convention 


Python does not have private data. Instead, you use naming conventions to design classes 


that encourage correct use. By convention, Python programmers know that any attribute 
name beginning with an underscore (_) is for a class’s internal use only. Client code should 
use the class’s methods and—as you'll see in the next section—the class’s properties to 
interact with each object’s internal-use data attributes. Attributes whose identifiers do not 
begin with an underscore (_) are considered publicly accessible for use in client code. In the 
next section, we'll define a Time class and use these naming conventions. However, even 


when we use these conventions, attributes are always accessible. 


10.4 PROPERTIES FOR DATA ACCESS 


Let’s develop a Time class that stores the time in 24-hour clock format with hours in the 
range 0—23, and minutes and seconds each in the range 0-59. For this class, we'll provide 
properties, which look like data attributes to client-code programmers, but control the 
manner in which they get and modify an object’s data. This assumes that other programmers 


follow Python conventions to correctly use objects of your class. 


10.4.1 Test-Driving Class Time 


Before we look at class Time’s definition, let’s demonstrate its capabilities. First, ensure that 


you're in the ch10 folder, then import class Time from timewithproperties.py: 





lick here to view code image 


In [1]: from timewithproperties import Time 


Creating a Time Object 


Next, let’s create a Time object. Class Time’s init method has hour, minute and 
second parameters, each with a default argument value of 0. Here, we specify the hour and 


minute—second defaults to 0: 


lick here to view code image 





In [2]: wake up = Time (hour=6, minute=30) 


Displaying a Time Object 


Class Time defines two methods that produce string representations of Time object. When 
you evaluate a variable in IPython as in snippet [3], [Python calls the object’s repr __ 
special method to produce a string representation of the object.Our repr 


implementation creates a string in the following format: 


lick here to view code image 


In [3]: wake_up 


Out[3]: Time (hour=6, minute=30, second=0) 


We'll also provide the _ str___ special method, which is called when an object is converted 
to a string, such as when you output the object with print. * Our str __ implementation 


creates a string in 12-hour clock format: 


* Ifa class does not provide str__ and an object of the class is converted to a string, the 


classs_ repr ___ method is called instead. 


In [4]: print (wake_up) 
6:30:00 AM 


Getting an Attribute Via a Property 


Class time provides hour, minute and second properties, which provide the convenience 
of data attributes for getting and modifying an object’s data. However, as you'll see, 
properties are implemented as methods, so they may contain additional logic, such as 
specifying the format in which to return a data attribute’s value or validating a new value 


before using it to modify a data attribute. Here, we get the wake_up object’s hour value: 


In [5]: wake_up.hour 


Though this snippet appears to simply get an hour data attribute’s value, it’s actually a call to 
an hour method that returns the value of a data attribute (which we named _hour, as you'll 


see in the next section). 


Setting the Time 


You can set a new time with the Time object’s set_time method. Like method init, 
method set_time provides hour, minute and second parameters, each with a default of 
0: 


lick here to view code image 





In [6]: wake_up.set_time (hour=7, minute=45) 
In [7]: wake_up 
Out[7]: Time (hour=7, minute=45, second=0) 


Setting an Attribute via a Property 


Class Time also supports setting the hour, minute and second values individually via its 


properties. Let’s change the hour value to 6: 


lick here to view code image 


In [8]: wake_up.hour = 6 
Gre LO wake up 


Out[9]: Time (hour=6, minute=45, second=0) 


Though snippet [8] appears to simply assign a value to a data attribute, it’s actually a call to 
an hour method that takes 6 as an argument. The method validates the value, then assigns it 


to a corresponding data attribute (which we named _hour, as you'll see in the next section). 


Attempting to Set an Invalid Value 


To prove that class Time’s properties validate the values you assign to them, let’s try to 





assign an invalid value to the hour property, which results in a ValueError: 


lick here to view code image 


ValueError Traceback (most recent call last) 
<ipython-input-10-lfce0716ef14> in <module>() 





=> l wake up.hour = 100 


~/Documents/examples/ch10/timewithproperties.py in hour (self, hour) 
20 ven et the hour. wy 
21 if not (Ol <= hour < 24): 

---> 22 raise ValueError(f'Hour ({hour}) must be 0-23") 
23 
24 self. hour = hour 


ValueError: Hour (100) must be 0-23 





10.4.2 Class Time Definition 


Now that we’ve seen class Time in action, let’s look at its definition. 


Class Time: init Method with Default Parameter Values 


Class Time’s init __ method specifies hour, minute and second parameters, each with 
a default argument of 0. Similar to class Account’s_ _ init __ method, recall that the self 
parameter is a reference to the Time object being initialized. The statements containing 


self.hour, self.minute and self.second appear to create hour, minute and 








second attributes for the new Time object (self). However, these statements actually call 
methods that implement the class’s hour, minute and second properties (lines 13-50). 
Those methods then create attributes named hour, minuteand_ second that are meant 


for use only inside the class: 


lick here to view code image 


1 # timewithproperties.py 


2 """Class Time with read-write properties. "m 


3 

4 class Time: 

5 """Class Time with read-write properties. "mm 

6 

7 def init _ (self, hour=0, minute=0, second=0): 
8 Wrwinitvalize each deeri DUCE in 

9 self.hour = hour # 0-23 

10 self.minute = minute # 0-59 

11 self.second = second # 0-59 


Class Time: hour Read-Write Property 


Lines 13—24 define a publicly accessible read-write property named hour that 
manipulates a data attribute named hour. The single-leading-underscore (_) naming 
convention indicates that client code should not access hour directly. As you saw in the 
previous section’s snippets [5] and [8], properties look like data attributes to programmers 
working with Time objects. However, notice that properties are implemented as methods. 
Each property defines a getter method which gets (that is, returns) a data attribute’s value 


and can optionally define a setter method which sets a data attribute’s value: 


lick here to view code image 


13 @property 

14 def hour(self): 

15 TOrRO CUCA ENS NOUT MAn 
16 return self. hour 

17 

18 @hour.setter 

19 def hour(self, hour): 

20 Tu Set CHS hour. a! 

21 ic moe O <= hour < 2A): 
22 raise ValueError(f'Hour ({hour}) must be 0-23') 
23 

24 self. hour = hour 

25 


The @property decorator precedes the property’s getter method, which receives only a 
self parameter. Behind the scenes, a decorator adds code to the decorated function—in this 
case to make the hour function work with attribute syntax. The getter method’s name is the 
property name. This getter method returns the hour data attribute’s value. The following 


client-code expression invokes the getter method: 


wake_up.hour 


You also can use the getter method inside the class, as you'll see shortly. 


A decorator of the form @property_name.setter (in this case, @hour . setter) 
precedes the property’s setter method. The method receives two parameters—sel1f anda 


parameter (hour) representing the value being assigned to the property. If the hour 


parameter’s value is valid, this method assigns it to the self objects hour attribute; 





otherwise, the method raises a ValueError. The following client-code expression invokes 


the setter by assigning a value to the property: 


wake_up.hour = 8 


We also invoked this set ter inside the class at line9g of init: 


self.hour = hour 


Using the setter enabled us to validate init __’s hour argument before creating and 
initializing the objects hour attribute, which occurs the first time the hour property’s 
setter executes as a result of line 9. A read-write property has both a getter and a setter. A 
read-only property has only a getter. 


Class Time: minute and second Read-Write Properties 


Lines 26-37 and 39-50 define read-write minute and second properties. Each property’s 
setter ensures that its second argument is in the range 0—59 (the valid range of values for 


minutes and seconds): 


lick here to view code image 





26 @property 

27 def minute(self): 

28 ""wReturn the minute. 0" 
29 return self. minute 

30 

31 @minute.setter 

32 def minute(self, minute): 

33 """Set the minute.""" 

34 if not (0 <= minute < 60): 
35 raise ValueError(f'Minute ({minute}) must be 0-59") 
36 

37 self. minute = minute 

38 

39 @property 

40 def second(self): 

41 """Return the second. 7" 
42 return selr. second 

43 

44 @second.setter 

45 def second(self, second): 

46 """Set the second. "Tm 

47 if not (0 <= second < GO) es 
48 raise ValueError(f'Second ({second}) must be 0-59') 
49 

50 self. second = second 

51 


Class Time: Method set time 


We provide method set_time as a convenient way to change all three attributes with a 
single method call. Lines 54—56 invoke the setters for the hour, minute and second 


properties: 


lick here to view code image 





52 def set_time(self, hour=0, minute=0, second=0): 
53 """Set values of hour, minute, and seconde 
54 self.hour = hour 

55 self.minute = minute 

56 self.second = second 

57 


Class Time: Special Method repr _ 


When you pass an object to built-in function repr—which happens implicitly when you 
evaluate a variable in an IPython session—the corresponding class’s__repr___ special 


method is called to get a string representation of the object: 


lick here to view code image 








58 det repr (self): 

59 MURetuem Time string FOE PEPE uu 

60 return (f'Time(hour={self.hour}, minute={self.minute}, ' + 
61 f'second={self.second}) ') 

62 


The Python documentation indicates that _ repr ___ returns the “official” string 
representation of the object. Typically this string looks like a constructor expression that 


creates and initializes the object, ° as in: 


2 


ttps://docs.python.org/3/reference/datamodel.html. 
'Time (hour=6, minute=30, second=0)' 


which is similar to the constructor expression in the previous section’s snippet [2]. Python 
has a built-in function eval that could receive the preceding string as an argument and use it 


to create and initialize a Time object containing values specified in the string. 


Class Time: Special Method str __ 


For our class Time we also define the __str___special method. This method is called 
implicitly when you convert an object to a string with the built-in function str, such as when 


you print an object or call str explicitly. Our implementation of _str__ creates a string 





in 12-hour clock format, such as '7:59:59 AM'or'12:30:45 PM!: 


lick here to view code image 


63 det istir herf): 


64 Te Pein Time in 12-hour Clock torma 4 

65 return (( L2" ae selt: hour in (0, 12) else stelsel nour 3 12) F 
66 f':{self.minute:052}:{self.second:0>2}"' + 

67 (' AM' if self.hour < 12 else ' PM')) 

















10.4.3 Class Time Definition Design Notes 


Let’s consider some class-design issues in the context of our Time class. 


Interface of a Class 


Class Time’s properties and methods define the class’s public interface—that is, the set of 


properties and methods programmers should use to interact with objects of the class. 


Attributes Are Always Accessible 


Though we provided a well-defined interface, Python does not prevent you from directly 


manipulating the data attributes hour, minute and_ second, asin: 


lick here to view code image 





In [1]: from timewithproperties import Time 

In [2]: wakeup = Time (hour=7, minute=45, second=30) 
In [3]: wake_up. hour 

Owe ksi. m 


In [4]: wake up. hour = 100 


In [5]: wake_up 
Out[5]: Time (hour=100, minute=45, second=30) 








After snippet [4], the wake_up object contains invalid data. Unlike many other object- 
oriented programming languages, such as C++, Java and C#, data attributes in Python 
cannot be hidden from client code. The Python tutorial says, “nothing in Python makes it 


possible to enforce data hiding—it is all based upon convention.” ° 


3 ttps://docs.python.org/3/tutorial/classes.html#random-remarks. 


Internal Data Representation 


We chose to represent the time as three integer values for hours, minutes and seconds. It 
would be perfectly reasonable to represent the time internally as the number of seconds since 
midnight. Though we’d have to reimplement the properties hour, minute and second, 
programmers could use the same interface and get the same results without being aware of 
these changes. We leave it to you to make this change and show that client code using Time 


objects does not need to change. 


Evolving a Class’s Implementation Details 


When you design a class, carefully consider the class’s interface before making that class 
available to other programmers. Ideally, you'll design the interface such that existing code 
will not break if you update the class’s implementation details—that is, the internal data 
representation or how its method bodies are implemented. 


If Python programmers follow convention and do not access attributes that begin with 
leading underscores, then class designers can evolve class implementation details without 


breaking client code. 


Properties 


It may seem that providing properties with both setters and getters has no benefit over 
accessing the data attributes directly, but there are subtle differences. A getter seems to allow 
clients to read the data at will, but the getter can control the formatting of the data. A setter 
can scrutinize attempts to modify the value of a data attribute to prevent the data from being 


set to an invalid value. 


Utility Methods 


Not all methods need to serve as part of a class’s interface. Some serve as utility methods 
used only inside the class and are not intended to be part of the class’s public interface used 
by client code. Such methods should be named with a single leading underscore. In other 
object-oriented languages like C++, Java and C#, such methods typically are implemented as 


private methods. 


Module datetime 


In professional Python development, rather than building your own classes to represent 
times and dates, you'll typically use the Python Standard Library’s datetime module 


capabilities. For more details about the datet ime module, see: 


ttps://docs.python.org/3/library/datetime.html 


10.5 SIMULATING “PRIVATE” ATTRIBUTES 


In programming languages such as C++, Java and C#, classes state explicitly which class 
members are publicly accessible. Class members that may not be accessed outside a class 
definition are private and visible only within the class that defines them. Python 
programmers often use “private” attributes for data or utility methods that are essential to a 


class’s inner workings but are not part of the class’s public interface. 


As you’ve seen, Python objects’ attributes are always accessible. However, Python has a 
naming convention for “private” attributes. Suppose we want to create an object of class 


Time and to prevent the following assignment statement: 


wake up. hour = 100 


that would set the hour to an invalid value. Rather than hour, we can name the attribute 
__ hour with two leading underscores. This convention indicates that _ hour is “private” 
and should not be accessible to the class’s clients. To help prevent clients from accessing 
“private” attributes, Python renames them by preceding the attribute name with 
_ClassName, asin Time hour. This is called name mangling. If you try assign to 


__ hour, asin 


wake up- hour = 100 





Python raises an AttributeError, indicating that the class does not have an hour 


attribute. We’ll show this momentarily. 


IPython Auto-Completion Shows Only “Public” Attributes 


In addition, [Python does not show attributes with one or two leading underscores when you 
try to auto-complete an expression like 


wake_up. 


by pressing Tab. Only attributes that are part of the wake _up object’s “public” interface are 
displayed in the [Python auto-completion list. 


Demonstrating “Private” Attributes 


To demonstrate name mangling, consider class PrivateClass with one “public” data 


attribute public data and one “private” data attribute _ private data: 


lick here to view code image 


1 # private.py 


2 mong lass with publiie and private attributes, 1" 

3 

4 class PrivateClass: 

5 te Class with: public and private attri bubes CAS 

6 

7 def J init (self): 

8 "a Tyiicralize the public and private avieubuceds. 7" 

9 self.public data = publiet ý public attribute 

10 self. private data = "peivaice y private attribute 


Let’s create an object of class PrivateData to demonstrate these data attributes: 


lick here to view code image 


In [1]: from private import PrivateClass 


in [2]; my ob eck = Privateclasis() 


Snippet [3] shows that we can access the public data attribute directly: 


lick here to view code image 


manele my object public data 
Ouest apu 


However, when we attempt to access _ private data directly in snippet [4], we get an 


AttributeError indicating that the class does not have an attribute by that name: 





lick here to view code image 


AttributeError Traceback (most recent call last) 
<ipython-input-4-d896bfdf2053> in <module>() 





=-——2 1 My object. private data 





Attributekrror: “Privateclass object hasi no attribute private data’ 


This occurs because python changed the attribute’s name. Unfortunately, the attribute 


_ private data is still indirectly accessible. 


10.6 CASE STUDY: CARD SHUFFLING AND DEALING 
SIMULATION 


Our next example presents two custom classes that you can use to shuffle and deal a deck of 
cards. Class Card represents a playing card that has a face (('Ace', '2', '3',, 'Jack', 
'Queen', 'King') anda suit ('Hearts', 'Diamonds', 'Clubs', 'Spades'). Class 
DeckOfCards represents a deck of 52 playing cards as a list of Card objects. First, we'll test- 
drive these classes in an IPython session to demonstrate card shuffling and dealing 
capabilities and displaying the cards as text. Then we'll look at the class definitions. Finally, 
we'll use another IPython session to display the 52 cards as images using Matplotlib. We'll 


show you where to get nice-looking public-domain card images. 


10.6.1 Test-Driving Classes Card and DeckOfCards 


Before we look at classes Card and DeckOfCards, let’s use an IPython session to 


demonstrate their capabilities. 


Creating, Shuffling and Dealing the Cards 


First, import class DeckOfCards from deck. py and create an object of the class: 


lick here to view code image 


crac [Palin| 


TA EZIK 





from deck import DeckOfCards 


deck_of_ cards = DeckOfCards () 


Diamonds, Clubs and Spades): 


lick here to view code image 


TASIE 
Ace of Hearts 

5 of Hearts 

9 of Hearts 

King of Hearts 

4 of Diamonds 

8 of Diamonds 
Queen of Diamonds 
3 of Clubs 

1 of Clubs 

Jack or (Clubs 

2 of Spades 

6 of Spades 

10 of Spades 


print (deck of cards) 


2 of Hearts 

6 of Hearts 

10 of Hearts 
Ace of Diamonds 
5 of Diamonds 

9 of Diamonds 
King of Diamonds 
4 of Clubs 

3 or Clubs 
Queen of Clubs 
3 of Spades 

7 of Spades 
Jack of Spades 


3 of Hearts 

7 of Hearts 
Jack of Hearts 
2 of Diamonds 
6 of Diamonds 
10 of Diamonds 
Ace of Clubs 
Dot Clubs 
oor 1Elubs 
King of Clubs 
4 of Spades 

8 of Spades 


Queen of Spades 


DeckOfCards method init __ creates the 52 Card objects in order by suit and by face 
within each suit. You can see this by printing the deck of cards object, which calls the 
DeckOfCards class’s__str__ method to get the deck’s string representation. Read each 


row left-to-right to confirm that all the cards are displayed in order from each suit (Hearts, 


4 of Hearts 

8 of Hearts 
Queen of Hearts 

3 of Diamonds 

7 of Diamonds 
Jack of Diamonds 

2 of Clubs 

6 of Clubs 


10 ob Clubs 
Ace of Spades 

5 of Spades 

9 of Spades 
King of Spades 








« a >» 





Next, let’s shuffle the deck and print the deck_of_cards object again. We did not specify a 


seed for reproducibility, so each time you shuffle, youll get different results: 


lick here to view code image 


Dre FAI 


nay RONS 
King of Hearts 
of Hearts 
Or Clubs 
of Spades 
of Clubs 
of Diamonds 


of Hearts 


WD Oo e © © OH U1 


of Spades 

Ace of Hearts 
King of Diamonds 
5 of Diamonds 

10 of Diamonds 


9 of Spades 


deck orf cards.shuffle() 


Pranic (deck of cards) 


Queen of Clubs 
of Hearts 
of Diamonds 
of Spades 
of Spades 
of Hearts 
of Spades 


of Diamonds 


W oOo S @ J OF oO ~i 


of Diamonds 
Jack of Spades 
4 of Clubs 
2 of Clubs 


Jack of Hearts 


Queen of Diamonds 
4 of Hearts 

3 of Hearts 
Queen of Spades 
Jack of Diamonds 
6 of Spades 

Ono elubs 

3 (Or Clubs 

2 of Diamonds 
Jack of (Clubs 
Queen of Hearts 
Ace of Diamonds 


6 of Diamonds 


10 or Clubs 

2 of Hearts 

10 of Hearts 
Ace of Clubs 

10 of Spades 
King of Spades 
King of Clubs 
Ace of Spades 


6 of Hearts 

2 of Spades 

9 of Clubs 

7 of Diamonds 
7 of, Clubs 








—0——— ee > 





Dealing Cards 


We can deal one Card at a time by calling method deal_card. [Python calls the returned 


Cardobject’s repr ___ method to produce the string output shown in the Out [] prompt: 


lick here to view code image 


morel: deck ot cards..deal vcard) 
Out[6]: Card(face='King', suit='Hearts') 


Class Card’s Other Features 


To demonstrate class Card’s_ str method, let’s deal another card and pass it to the 


built-in str function: 


lick here to view code image 


Im [Vij cand = deck of cards deal cardi()) 
in [8]: stricerd) 
OU [PSI VOucen of Clubs! 


Each Card has a corresponding image file name, which you can get via the image_name 


read-only property. We'll use this soon when we display the Cards as images: 


lick here to view code image 


In [9]: card.image_name 
Out POl “Queen of Clubs png” 


10.6.2 Class Card—Introducing Class Attributes 


Each Card object contains three string properties representing that Card’s face, suit and 
image_name (a file name containing a corresponding image). As you saw in the preceding 
section’s IPython session, class Card also provides methods for initializing a Card and for 


getting various string representations. 


Class Attributes FACES and SUITS 


Each object of a class has its own copies of the class’s data attributes. For example, each 
Account object has its own name and balance. Sometimes, an attribute should be shared 
by all objects of a class. A class attribute (also called a class variable) represents class- 
wide information. It belongs to the class, not to a specific object of that class. Class Card 


defines two class attributes (lines 5-7): 


e FACES isa list of the card face names. 





e SUITS isa list of the card suit names. 





lick here to view code image 


1 # card.py 





2 "NCard elass that represents a playing card and its image file name.""" 
3 

4 class Card: 

5 RACHS: = pUAGen Vr. ai A aa uo. 

6 Oe ON WO ack. nV OuSenn Fanat] 

7 SUITS = | "Hearts, “Diamonds”, "Clubs", “Spades 

8 











You define a class attribute by assigning a value to it inside the class’s definition, but not 


inside any of the class’s methods or properties (in which case, they'd be local variables). 





FACES and SUITS are constants that are not meant to be modified. Recall that the Style 





Guide for Python Code recommends naming your constants with all capital letters. 4 








4 Recall that Python does not have true constants, so FACES and SUITS are still modifiable. 


We'll use elements of these lists to initialize each Card we create. However, we do not need a 
separate copy of each list in every Card object. Class attributes can be accessed through any 


object of the class, but are typically accessed through the class’s name (as in, Card. FACES or 








Card.SUITS). Class attributes exist as soon as you import their class’s definition. 


Card Method init _ 


When you create a Card object, method init defines the object’s faceand suit 


data attributes: 


lick here to view code image 


9 def init __ (self; face, suit): 

10 WAV EMA tialize da Card wich a face and SOE ees 
11 seli face z Tace 

12 self. suit = ouit 

13 


Read-Only Properties face, suit and image name 


Once a Card is created, its face, suit and image name do not change, so we implement 
these as read-only properties (lines 14-17, 19—22 and 24—27). Properties face and suit 
return the corresponding data attributes face and suit. A property is not required to 
have a corresponding data attribute. To demonstrate this, the Card property image _name’s 
value is created dynamically by getting the Card object’s string representation with 

str (self), replacing any spaces with underscores and appending the '.png' filename 
extension. So, 'Ace of Spades' becomes 'Ace of Spades.png!'. We'll use this file 
name to load a PNG-format image representing the Card. PNG (Portable Network Graphics) 


is a popular image format for web-based images. 


lick here to view code image 


14 
15 
16 
ala 
18 
19 
20 
24 
22 
23 
24 
25 
26 
27 
28 


@property 
def face(self): 
"""Return the Card's self. face 


return seit. hace 


@property 
def suit(self): 
Rotu rn yelalaeaGrchicell scp self. Surt 


Rectum Seite Suit 


@property 
def image name (self): 
"""Return the Card's image file 


return str(self).replace(' T; 


value.""" 


value.""" 


name.""" 


_') + '.png' 


Methods That Return String Representations of a Card 


Class Card provides three special methods that return string representations. As in class 


Time,method repr __ returns a string representation that looks like a constructor 


expression for creating and initializing a Card object: 


lick here to view code image 


29 
30 
31 
32 


def repre (self); 


"""Return string representation for repe (iy 


return f"Card(face='{self.face}', 


suit=' {self suity" 


Method _ str _ _ returns a string of the format ' face of suit', suchas 'Ace of Hearts': 


lick here to view code image 


33 
34 
35 
36 


def TeSt i Serf 


"""Return string representation for Stes ny 


return f£'{self.face} of fself.suit}' 


When the preceding section’s [Python session printed the entire deck, you saw that the Cards 


were displayed in four left-aligned columns. As you'll see in the _ str method of class 


DeckOfCards, we use f-strings to format the Cards in fields of 19 characters each. Class 


Card’s special method __ format __ is called when a Card object is formatted as a string, 


such as in an f-string: 


lick here to view code image 


37 
38 


def T format (selfi format): 


"""Return formatted string representation for str().""" 


39 return ££" {striselt)={tormat} }" 


This method’s second argument is the format string used to format the object. To use the 
format parameter’s value as the format specifier, enclose the parameter name in braces to 
the right of the colon. In this case, we’re formatting the Card object’s string representation 
returned by str (self).We'lldiscuss format again when we present the str _ 


method in class DeckOfCards. 


10.6.3 Class DeckOfCards 











Class DeckOfCards has a class attribute NUMBER_OF_ CARDS, representing the number of 





Cards in a deck, and creates two data attributes: 


e current _ card keeps track of which Card will be dealt next (0-51) and 


e deck (line 12) is a list of 52 Card objects. 


Method __init__ 


DeckOfCards method init __ initializesa_deck of Cards. The for statement fills the 


list deck by appending new Card objects, each initialized with two strings—one from the 








list Card. FACES and one from Card. SUITS. The calculation count % 13 always results 


in a value from o to 12 (the 13 indices of Card. FACES), and the calculation count // 13 





always results in a value from o to 3 (the four indices of Card. SUITS). When the deck list 





is initialized, it contains the Cards with faces 'Ace' through 'King' in order for all the 


Hearts, then the Diamonds, then the Clubs, then the Spades. 


lick here to view code image 








1 # deck.py 

2 """Deck class represents a deck of Gards EAN 

3 import random 

4 from card import Card 

5 

6 class DeckOfCards: 

7 NUMBER_OF CARDS = 52 # constant number of Cards 

8 

9 det init (GEIT): 

10 WUT tie Lake: che decko mi 

11 self. Current Card = 0 

12 self deck = [] 

13 

14 for count in range (DeckOfCards.NUMBER_ OF CARDS): 
15 self. deck. -append (Card (Card. FACES [Count % 13], 
16 Card, SULTS | keounit // L31 

17 


Method shuffle 


Method shuffle resets current card to 0, then shuffles the Cardsin_ deck using the 


random module’s shuffle function: 


lick here to view code image 


18 def shut flelself): 

19 Vy Shue elek deck 4ni 

20 selt Current: eand =o 
21 random.shuffle(self. deck) 
22 


Method deal card 


Method deal_card deals one Card from deck. Recall that current card indicates the 
index (0—51) of the next Card to be dealt (that is, the Card at the top of the deck). Line 26 
tries to get the deck element at index current card. If successful, the method 
increments current card by1, then returns the Card being dealt; otherwise, the method 


returns None to indicate there are no more Cards to deal. 


lick here to view code image 





23 def deal icardi(sieike) ss 
24 MMe TUM one Card. Tan 
25 BEY: 
26 card = selfi deckiself current ecard] 
21 selfi current card Ta l 
28 return card 
29 except: 
30 return None 
31 
Method str __ 


Class DeckOfCards also defines special method _str__ to geta string representation of 
the deck in four columns with each Card left aligned in a field of 19 characters. When line 37 
formats a given Card,its format __ special method is called with format specifier '<19' 
as the method’s format argument. Method format __ then uses '<19' to create the 


Card’s formatted string representation. 


lick here to view code image 





32 Cer ease ra uisieden)) 

33 """Return a string representation of the currente deck, Win 
34 S oant 

35 

36 for index, Card in numerate (self. deck): 

37 s t= f'self. deck index] <1 9)" 

38 if (index + 1) % 4 == 

39 S += '\n' 

40 


41 return s 


10.6.4 Displaying Card Images with Matplotlib 


So far, we’ve displayed Cards as text. Now, let’s display Card images. For this 


demonstration, we downloaded public-domain ° card images from Wikimedia Commons: 


5 ttps://creativecommons.org/publicdomain/zero/1.0/deed.en. 





ttps://commons.wikimedia.org/wiki/-Category:SVG English pattern playing cards 
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These are located in the ch10 examples folder’s card_images subfolder. First, let’s create a 


DeckOfCards: 


lick here to view code image 


In [1]: from deck import DeckOfCards 


In [2i; deck of cards = DeckOfCards () 


Enable Matplotlib in [Python 


Next, enable Matplotlib support in IPython by using the smatplot1lib magic: 


lick here to view code image 


in? Ris smatplotlib 
Using matplotlib backend: Qt5SAgg 


Create the Base Path for Each Image 


Before displaying each image, we must load it from the card_images folder. We'll use the 
pathlib module’s Path class to construct the full path to each image on our system. 
Snippet [5] creates a Path object for the current folder (the ch10 examples folder), which is 
represented by '.', then uses Path method joinpath to append the subfolder containing 


the card images: 
lick here to view code image 


in S from pathivb import Barth 


in Psi path = Pathi T sioinpathi(eard images ) 


Import the Matplotlib Features 


Next, let’s import the Matplotlib modules we'll need to display the images. We’ll use a 


function from matplotlib. image to load the images: 


lick here to view code image 


In Tel: Import matplotlib.pyplot as plt 


In [7]: import matplotlib.image as mpimg 


Create the Figure and Axes Objects 


The following snippet uses Matplotlib function subplots to create a Figure object in which 
we'll display the images as 52 subplots with four rows (nrows) and 13 columns (ncols). The 
function returns a tuple containing the Figure and an array of the subplots’ Axes objects. 


We unpack these into variables figure and axes list: 


lick here to view code image 
In [8]: figure, axes list = plt.subplots(nrows=4, ncols=13) 


When you execute this statement in IPython, the Matplotlib window appears immediately 
with 52 empty subplots. 


Configure the Axes Objects and Display the Images 


Next, we iterate through all the Axes objects in axes_1ist. Recall that ravel provides a 
one-dimensional view of a multidimensional array. For each Axes object, we perform the 


following tasks: 


e Were not plotting data, so we do not need axis lines and labels for each image. The first 


two statements in the loop hide the x- and y-axes. 
e The third statement deals a Card and gets its image name. 


e The fourth statement uses Path method joinpath to append the image _name to the 
Path, then calls Path method resolve to determine the full path to the image on our 
system. We pass the resulting Path object to the built-in str function to get the string 
representation of the image’s location. Then, we pass that string to the 


matplotlib.image module’s imread function, which loads the image. 


e The last statement calls Axes method imshow to display the current image in the current 


subplot. 


lick here to view code image 


Tm [i for axes In axes list- ravel): 
axes.get_xaxis().set_visible(False) 
axes.get_yaxis().set_ visible (False) 
image name = deck of cards -deal card() -image name 


img = mpimg-imread (str (path. jornpath(image name) .resolve())) 





axes .imshow (img) 


Maximize the Image Sizes 


At this point, all the images are displayed. To make the cards as large as possible, you can 
maximize the window, then call the Matplotlib Figure object’s tight_layout method. 
This removes most of the extra white space in the window: 


In [10]: figure.tight layout () 


The following image shows the contents of the resulting window: 
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Shuffle and Re-Deal the Deck 


To see the images shuffled, call method shuf fle, then re-execute snippet [9]’s code: 


lick here to view code image 


ra [ii] ss deck of icards.jshutt le .() 


ta [les tor axes in axes liist raveli(ji 
axes.get_xaxis().set_visible(False) 
axes.get_yaxis().set_visible (False) 
image name = deck_of cards.deal_card().image_name 
img = mpimg.imread(str(path.joinpath (image name) .resolve())) 





axes. imshow (img) 
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10.7 INHERITANCE: BASE CLASSES AND SUBCLASSES 


Often, an object of one class is an object of another class as well. For example, a CarLoan isa 


Loan as are Home] 


inherit from class 


[Improvement Loans and MortgageLoans. Class CarLoan can be said to 





Loan. In this context, class Loan is a base class, and class CarLoanisa 


subclass. A CarLoan is a specific type of Loan, but it’s incorrect to claim that every Loan is a 


CarLoan—the Loan could be of any type. The following table lists simple examples of base 


classes and subclasses—base classes tend to be “more general” and subclasses “more 


specific”: 


Base class 


Subclasses 





Student 


Shape 


Loan 





Employee 





BankAccount 


GraduateStudent, UndergraduateStudent 


Circle, Triangle, Rectangle, Sphere, Cub 











CarLoan, HomeImprovementLoan, MortgageLoan 


ee@ullieyy, orai 


CheckingAccount, SavingsAccount 


ecause every subclass object is an object of its base class, and one base class can have many 
subclasses, the set of objects represented by a base class is often larger than the set of objects 
represented by any of its subclasses. For example, the base class Vehicle represents all 
vehicles, including cars, trucks, boats, bicycles and so on. By contrast, subclass Car 


represents a smaller, more specific subset of vehicles. 


CommunityMember Inheritance Hierarchy 


Inheritance relationships form tree-like hierarchical structures. A base class exists in a 
hierarchical relationship with its subclasses. Let’s develop a sample class hierarchy (shown in 
the following diagram), also called an inheritance hierarchy. A university community has 
thousands of members, including employees, students and alumni. Employees are either 
faculty or staff members. Faculty members are either administrators (e.g., deans and 
department chairpersons) or teachers. The hierarchy could contain many other classes. For 
example, students can be graduate or undergraduate students. Undergraduate students can 
be freshmen, sophomores, juniors or seniors. With single inheritance, a class is derived 
from one base class. With multiple inheritance, a subclass inherits from two or more base 
classes. Single inheritance is straightforward. Multiple inheritance is beyond the scope of this 
book. Before you use it, search online for the “diamond problem in Python multiple 


CommunityMember | 


Employee Student | Alum | 
Faculty Staff | 
Administrator l Teacher | 


Each arrow in the hierarchy represents an is-a relationship. As we follow the arrows upward 


inheritance.” 





in this class hierarchy, we can state, for example, that “an Employee is a Community- 


Member” and “a Teacher is a Faculty member.” Communit yMember is the direct base 





class of Employee, Student and Alum and is an indirect base class of all the other classes in 
the diagram. Starting from the bottom, you can follow the arrows and apply the is-a 
relationship up to the topmost superclass. For example, Administrator isa Faculty 





member, is an Employee, is a Community-Member and, of course, ultimately is an object. 


Shape Inheritance Hierarchy 


Now consider the Shape inheritance hierarchy in the following class diagram, which begins 


ith base class Shape, followed by subclasses TwoDimensionalShape and 


ThreeDimensionalShape. Each Shape is either a TwoDimensionalShape ora 











ThreeDimensionalShape. The third level of this hierarchy contains specific types of 








TwoDimensionalShapes and ThreeDimensionalShapes. Again, we can follow the 








arrows from the bottom of the diagram to the topmost base class in this class hierarchy to 


identify several is-a relationships. For example, a Triangle is a TwoDimensionalShape 





and is a Shape, while a Sphere is a ThreeDimensionalShape and is a Shape. This 





hierarchy could contain many other classes. For example, ellipses and trapezoids also are 


TwoDimensionalShapes, and cones and cylinders also are ThreeDimensionalShapes. 








Shape 


ra 





ThreeDimensionalShape 





TwoDimensionalShape 


“isa” vs. “has a” 


Inheritance produces “is-a” relationships in which an object of a subclass type may also be 
treated as an object of the base-class type. You’ve also seen “has-a” (composition) 
relationships in which a class has references to one or more objects of other classes as 


members. 


10.8 BUILDING AN INHERITANCE HIERARCHY; 
INTRODUCING POLYMORPHISM 


Let’s use a hierarchy containing types of employees in a company’s payroll app to discuss the 
relationship between a base class and its subclass. All employees of the company have a lot in 
common, but commission employees (who will be represented as objects of a base class) are 
paid a percentage of their sales, while salaried commission employees (who will be 
represented as objects of a subclass) receive a percentage of their sales plus a base salary. 





First, we present base class CommissionEmployee. Next, we create a subclass 








SalariedCommissionEmployee that inherits from class CommissionEmployee. Then, 





we use an [Python session to create a SalariedCommissionEmployee object and 
demonstrate that it has all the capabilities of the base class and the subclass, but calculates its 


earnings differently. 


10.8.1 Base Class CommissionEmployee 


Consider class CommissionEmployee, which provides the following features: 





e Method init __ (lines 8-15), which creates the data attributes first name, 


_last_nameand_ ssn (Social Security number), and uses the setters of properties 


gross sales and commission rate to create their corresponding data attributes. 


e Read-only properties first name (lines 17-19), last_name (lines 21-23) and ssn (line 


25-27), which return the corresponding data attributes. 


e Read-write properties gross_sales (lines 29-39) and commission _ rate (lines 41- 


52), in which the setters perform data validation. 


e Method earnings (lines 54—56), which calculates and returns a 





CommissionEmployee’s earnings. 


e Method repr __ (lines 58—64), which returns a string representation of a 





CommissionEmployee. 


lick here to view code image 











1 # commmissionemployee.py 

2 """CommissionEmployee base Ciao. MA 

3 from decimal import Decimal 

4 

5 class CommissionEmployee: 

6 """An employee who gets paid commission based on gross sales, “uu 
7 

8 def init (self, first name, last name, ssn, 

9 gross sales, commission rate): 

10 """Tnitialize CommissionEmployee's attributes, “umn 

11 self. first name = first name 

12 self. last name = last_name 

13 self TSS = Ssn 

14 self. gross _ sales = gross sales 4 validate via property 
15 self commission rate = commission rate # validate via property 
16 

17 @property 

18 ast first mame (seit): 

19 return sel. rirse name 

20 

21 @property 

22 def last_name(self): 

23 return self. last_name 

24 

25 @property 

26 def ssn(self): 

27 return self: ssn 

28 

29 @property 

30 def gross salesitself)i 

31 return self. gross sales 

32 

33 (gross sales. setter 

34 def gross sales(self; sales): 

35 vumset gross sales or raise ValueError Li invalda. imu 
36 if sales < Decimal (10. 001): 

37 raise ValueError('Gross sales must be >= to 0') 


w 
œ 


39 self gross sales = salles 














40 
41 @property 
42 def commission rate(self): 
43 return self. commission rate 
44 
45 @commission_rate.setter 
46 def commission rate(self, rate): 
47 """Set commission rate or raise Valene nor E inya lrag uN 
48 Lf not (Decimal 0L 0) < rate < Decimal({*1.0")): 
49 raise ValueError ( 
50 "Interest rate must be greater than 0 and less than 1') 
51 
52 self. commission rate = frate 
53 
54 def earnings (self): 
55 vrercaleulatg earnings.” 
56 return self.gross_sales * self.commission_ rate 
57 
58 der rep ras (seie)): 
59 """Return string representation for EEDEN) ae 
60 return ('CommissionEmployee: ' + 
61 f'{self.first_name} {self.last_name}\n' + 
62 f'social security number: {self.ssn}\n' + 
63 f'gross sales: {self.gross_sales:.2f}\n' + 
64 flconmission rate: {self commission rates .26}™) 
4 > 





Properties first_name, last_name and ssn are read-only. We chose not to validate them, 
though we could have. For example, we could validate the first and last names—perhaps by 
ensuring that they’re of a reasonable length. We could validate the Social Security number to 
ensure that it contains nine digits, with or without dashes (for example, to ensure that it’s in 
the format ###-##-#### or ###44444#, where each # is a digit). 


All Classes Inherit Directly or Indirectly from Class object 


You use inheritance to create new classes from existing ones. In fact, every Python class 
inherits from an existing class. When you do not explicitly specify the base class for a new 
class, Python assumes that the class inherits directly from class object. The Python class 


hierarchy begins with class object, the direct or indirect base class of every class. So, class 





CommissionEmployee’s header could have been written as 


class CommissionEmployee (object): 





The parentheses after CommissionEmployee indicate inheritance and may contain a single 
class for single inheritance or a comma-separated list of base classes for multiple inheritance. 


Once again, multiple inheritance is beyond the scope of this book. 





Class CommissionEmp1loyee inherits all the methods of class object. Class object does 
not have any data attributes. Two of the many methods inherited from object are 


repr and str .So every class has these methods that return string 








representations of the objects on which they’re called. When a base-class method 


implementation is inappropriate for a derived class, that method can be overridden (i.e., 
redefined) in the derived class with an appropriate implementation. Method repr __ 


(lines 58—64) overrides the default implementation inherited into class 





CommissionEmployee from class object. ° 


© ee ttps://docs.python.org/3/reference/datamodel.html for objects 


overridable methods. 


Testing Class CommissionEmployee 





Let’s quickly test some of CommissionEmployee’s features. First, create and display a 





CommissionEmployee: 


lick here to view code image 


In [1]: from commissionemployee import CommissionEmployee 

In [2]: from decimal import Decimal 

In [3]: c = CommissionEmployee('Sue', UJjones |, "S35 =33=3333" ; 
Decimal {'10000.00"), Decimal (’0.06")) 

be [ase 

Out [4]: 


CommissionEmployee: Sue Jones 
social security number: 333-33-3333 
gross sales: 10000.00 


commission rate: 0.06 





Next, let’s calculate and display the CommissionEmployee’s earnings: 


lick here to view code image 


in Polls preine (Etc earnings sr 2E t) 
600.00 





Finally, let’s change the CommissionEmployee’s gross sales and commission rate, then 


recalculate the earnings: 


lick here to view code image 


En [es engross sales = Decimal(* 2000000") 
im [iis icacommussion rate = Deermali( 0). i") 


TATE e prne E eee carnaino seeped) 
2,000.00 


10.8.2 Subclass SalariedCommissionEmployee 


With single inheritance, the subclass starts essentially the same as the base class. The real 
strength of inheritance comes from the ability to define in the subclass additions, 


replacements or refinements for the features inherited from the base class. 





Many of a SalariedCommissionEmployee’s capabilities are similar, if not identical, to 





those of class CommissionEmployee. Both types of employees have first name, last name, 


Social Security number, gross sales and commission rate data attributes, and properties and 





methods to manipulate that data. To create class SalariedCommissionEmployee without 





using inheritance, we could have copied class CommissionEmployee’s code and pasted it 





into class SalariedCommissionEmployee. Then we could have modified the new class to 
include a base salary data attribute, and the properties and methods that manipulate the base 
salary, including a new earnings method. This copy-and-paste approach is often error- 
prone. Worse yet, it can spread many physical copies of the same code (including errors) 
throughout a system, making your code less maintainable. Inheritance enables us to “absorb” 
the features of a class without duplicating code. Let’s see how. 


Declaring Class SalariedCommissionEmployee 


We now declare the subclass SalariedCommissionEmployee, which inherits most of its 








capabilities from class CommissionEmployee (line 6). A 








SalariedCommissionEmployee is a CommissionEmployee (because inheritance passes 











on the capabilities of class CommissionEmployee), but class 














SalariedCommissionEmployee also has the following features: 


e Method init __ (lines 10-15), which initializes all the data inherited from class 





CommissionEmployee (we'll say more about this momentarily), then uses the 


base salary property’s setter tocreatea base salary data attribute. 


e Read-write property base salary (lines 17—27), in which the setter performs data 


validation. 
e A customized version of method earnings (lines 29-31). 


e A customized version of method repr ___ (lines 33-36). 


lick here to view code image 


# salariedcommissionemployee.py 








"""SalariedCommissionEmployee derived from CommissionEmployee.""" 
from commissionemployee import CommissionEmployee 


from decimal import Decimal 


class SalariedCommissionEmployee (CommissionEmployee) : 
"""An employee who gets paid a salary plus 


commission based on gross sales.""" 


Oo ort oaouwrk WYN EB 

















10 def init (self, first name, last name, ssn, 

11 gross_sales, commission rate, base salary): 
12 LAAIE yale tess cull neces SalariedCommissionEmployee's attributes.""" 
13 super (). EOE (first name, last_name, ssn, 

14 gross sales, commission rate) 

15 self base salary = base salary # validate via property 
16 

17 @property 

18 def base salary(self): 

19 return self. base salary 

20 

21 (base salary:setter 

22 def base salary(self, salary): 

23 """Set base salary or raise ValueError 1t ovalia. EN 
24 it salary < Decimal ('0.00") 2 

25 raise ValueError('Base salary must be >= to 0') 
26 

27 self. base salary = salary 

28 

29 def earnings (self 

30 muMCaleula te earnings, Tz 

31 return super().carnings() + self -base salary 

32 

33 dem M repr (self 

34 """Return string representation for repe (Jo T4 

35 return (Salant et: super. repr (t 

36 f'\nbase salary: {self.base salary:.2f}"') 


Inheriting from Class Commi 


ssionEmployee 


To inherit from a class, you must first import its definition (line 3). Line 6 


class SalariedCommission 


specifies that class SalariedCom 





Commissionl 


attributes, properties and methods in class SalariedCommission! 


nevertheless part of the new class, 


Employee. Though you do not see class CommissionEm 





Employee (CommissionEmployee) : 





missionEmployee inherits from 


G 


oyee’s data 





pl 
Empl 








oyee, they're 





as youll soon see. 


Method init and Built-In Function super 


Each subclass _ init _ must explicitly call its base class’s__ init __ to initialize the data 


attributes inherited from the base class. This call should be the first statement in the 


subclass’s init _ method. SalariedCommission! 


explicitly calls class Commission! 


the base-class portion of a SalariedCommissionl 


inherited data attributes from class Commission! 





Employee’s init method 





Employees init method (lines 13—14) to initialize 





Employee object (that is, the five 





Employee). The notation 


super(). init _ uses the built-in function super to locate and call the base class’s 


_ init _ method, passing the five arguments that initialize the inherited data attributes. 


Overriding Method earnings 


Class Salari 





edCommissionEmployee’s earnings method (lines 29-31) overrides class 





CommissionEmployee’s earnings method ( ection 10.8.1, lines 54—56) to calculate the 





earnings of a SalariedCommissionEmployee. The new version obtains the portion of the 





earnings based on commission alone by calling CommissionEmployee’s earnings method 





with the expression super () .earnings() (line 31). SalariedCommissionEmployee’s 


earnings method then adds the base_salary to this value to calculate the total earnings. 





By having SalariedCommissionEmployee’s earnings method invoke 





CommissionEmployee’s earnings method to calculate part of a 





SalariedCommissionEmployee’s earnings, we avoid duplicating the code and reduce 


code-maintenance problems. 


Overriding Method repr _ 





SalariedCommissionEmployee’s repr __ method (lines 33—36) overrides class 





CommissionEmployee’s repr __ method( ection 10.8.1, lines 58—64) to return a 





String representation that’s appropriate for a SalariedCommissionEmployee. The 


subclass creates part of the string representation by concatenating 'Salaried' and the 





string returned by super(). repr ___(), which calls Commission-Employee’s 
__ repr __ method. The overridden method then concatenates the base salary information 


and returns the resulting string. 


Testing Class SalariedCommissionEmployee 





Let’s test class SalariedCommissionEmployee to show that it indeed inherited 





capabilities from class CommissionEmployee. First, let’s create a 





SalariedCommissionEmployee and print all of its properties: 


lick here to view code image 





ine (i from salariedcommissionemployee import SalariedCommissionEmployee 

In [10]: s = SalariedCommissionEmployee('Bob', 'Lewis', '444-44-4444', 
Decimal (5000.00), Decimal (10.04); Decimal (7300: 00T) 

In [11]: print(s.first_name, s.last_name, s.ssn, s.gross_ sales, 


TOUR s.commission rate, s.base salary) 
Bob Lewis 444-44-4444 5000.00 0.04 300.00 

















Notice that the SalariedCommissionEmployee object has all of the properties of classes 








CommissionEmployee and SalariedCommissionEmployee. 





Next, let’s calculate and display the SalariedCommissionEmployee’s earnings. Because 





we call method earnings ona SalariedCommissionEmployee object, the subclass 


version of the method executes: 


lick here to view code image 


Pn (PEA prance {is earnings ere AET 
500.00 


Now, lets modify the gross_sales, commission rateandbase_ salary properties, 





then display the updated data via the Salaried-Commission-Employee’s repr __ 
method: 


lick here to view code image 
ine [Sis swgross sales = Decamali( 10000700") 


In [14]: s.commission_ rate = Decimal ('0.05') 


In [15]: s.base_ salary = Decimal('1000.00") 











Tn [oi] PELES) 
SalariedCommissionEmployee: Bob Lewis 
social security number: 444-44-4444 
gross sales: 10000.00 

commission rate: 0.05 

base salary: 1000.00 





Again, because this method is called on a SalariedCommissionEmployee object, the 


subclass version of the method executes. Finally, let’s calculate and display the Salaried- 





Commission-Employee’s updated earnings: 


lick here to view code image 


in (Aas printi Vis earnings (e e AE) 
17500700 


Testing the “is a” Relationship 


Python provides two built-in functions—issubclass and isinstance—for testing “is a” 


relationships. Function issubclass determines whether one class is derived from another: 


lick here to view code image 





In [18]: issubclass(SalariedCommissionEmployee, CommissionEmployee) 
Out[18]: True 


Function isinstance determines whether an object has an “is a” relationship with a specific 








type. Because SalariedCommissionEmployee inherits from CommissionEmployee, 


both of the following snippets return True, confirming the “is a” relationship 


lick here to view code image 


In [19]: isinstance(s, CommissionEmployee) 


Out[19]: True 


In [20]: isinstance(s, SalariedCommissionEmployee) 
Out[20]: True 


10.8.3 Processing CommissionEmployees and 
SalariedCommissionEmp1loyees Polymorphically 


With inheritance, every object of a subclass also may be treated as an object of that subclass’s 
base class. We can take advantage of this “subclass-object-is-a-base-class-object” relationship 
to perform some interesting manipulations. For example, we can place objects related 
through inheritance into a list, then iterate through the list and treat each element as a base- 


class object. This allows a variety of objects to be processed in a general way. Let’s 





demonstrate this by placing the CommissionEmployee and 


SalariedCommissionEmployee objects in a list, then for each element displaying its 





string representation and earnings: 


lick here to view code image 


In [21]: employees = [c, s] 





In [22]: for employ in employees: 
print (employee) 
print (f'{employee.earnings():,.2f£}\n') 





CommissionEmployee: Sue Jones 
social security number: 333-33-3333 
gross sales: 20000.00 

commission rate: 0.10 

2,000.00 





SalariedCommissionEmployee: Bob Lewis 
social security number: 444-44-4444 
gross sales: 10000.00 

commission rate: 0.05 

base salary: 1000.00 

r SOON 


As you can see, the correct string representation and earnings are displayed for each 
employee. This is called polymorphism—a key capability of object-oriented programming 
(OOP). 


10.8.4 A Note About Object-Based and Object-Oriented Programming 


Inheritance with method overriding is a powerful way to build software components that are 
like existing components but need to be customized to your application’s unique needs. In the 
Python open-source world, there are a huge number of well-developed class libraries for 


which your programming style is: 


e know what libraries are available, 


e know what classes are available, 
e make objects of existing classes, and 


e send them messages (that is, call their methods). 


This style of programming is called object-based programming (OBP). When you do 
composition with objects of known classes, you're still doing object-based programming. 
Adding inheritance with overriding to customize methods to the unique needs of your 
applications and possibly process objects polymorphically is called object-oriented 
programming (OOP). If you do composition with objects of inherited classes, that’s also 
object-oriented programming. 


10.9 DUCK TYPING AND POLYMORPHISM 


Most other object-oriented programming languages require inheritance-based “is a” 
relationships to achieve polymorphic behavior. Python is more flexible. It uses a concept 
called duck typing, which the Python documentation describes as: 


A programming style which does not look at an object’s type to determine if it has the right 
interface; instead, the method or attribute is simply called or used (“If it looks like a duck 

7 
and quacks like a duck, it must be a duck.”). 


7 ttps://docs.python.org/3/glossary.html#term-duck-typing. 


So, when processing an object at execution time, its type does not matter. As long as the 
object has the data attribute, property or method (with the appropriate parameters) you wish 
to access, the code will work. 


Let’s reconsider the loop at the end of ection 10.8.3, which processes a list of employees: 





for employ in employees: 
print (employee) 


print (f'{employee.earnings():,.2f}\n') 


In Python, this loop works properly as long as employees contains only objects that: 


e can be displayed with print (that is, they have a string representation) and 


e have an earnings method which can be called with no arguments. 


All classes inherit from object directly or indirectly, so they all inherit the default methods 
for obtaining string representations that print can display. If a class has an earnings 
method that can be called with no arguments, we can include objects of that class in the list 


employees, even if the object’s class does not have an “is a” relationship with class 





CommissionEmployee. To demonstrate this, consider class Wel 1 PaidDuck: 


lick here to view code image 


in [djs class WellPaidbDuck: 
det ee ncpre (Sele) i: 
return 'I am a well-paid duck' 
def earnings (self): 
return Decimal('1l 000 000.00") 


Wel1PaidDuck objects, which clearly are not meant to be employees, will work with the 





preceding loop. To prove this, let’s create objects of our classes CommissionEmployee, 








SalariedCommissionEmployee and Wel1PaidDuck and place them in a list: 


lick here to view code image 











In [2]: from decimal import Decimal 

In [3]: from commissionemployee import CommissionEmployee 

In [4]: from salariedcommissionemployee import SalariedCommissionEmployee 
To S c = CommissionEmployee('Sue', ones”, "“S33=33'-3333", 





Decimal ('10000.00')}, Decimal ('0.06')) 





In [6]: s = SalariedCommissionEmployee('Bob', 'Lewis', '444-44-4444', 
Decimal ('5000.00'), Decimal ("0.04"), Décimal('300.00")) 

In [7]: d = WellPaidDuck () 

In [8]: employees = [c, s, d] 





« E >» 





Now, let’s process the list using the loop from ection 10.8.3. As you can see in the output, 


Python is able to use duck typing to polymorphically process all three objects in the list: 


lick here to view code image 


In [9]: for employ in employees: 





print (employee) 


print (f'{employee.earnings():,.2f£}\n') 





CommissionEmployee: Sue Jones 
social security number: 333-33-3333 
gross sales: 10000.00 

commission rate: 0.06 

600.00 





SalariedCommissionEmployee: Bob Lewis 
social security number: 444-44-4444 
gross sales: 5000.00 

commission rate: 0.04 

base salary: 300.00 


500.00 


I am a well-paid duck 
1,000,000.00 


10.10 OPERATOR OVERLOADING 


You’ve seen that you can interact with objects by accessing their attributes and properties and 
by calling their methods. Method-call notation can be cumbersome for certain kinds of 
operations, such as arithmetic. In these cases, it would be more convenient to use Python’s 


rich set of built-in operators. 


This section shows how to use operator overloading to define how Python’s operators 
should handle objects of your own types. You’ve already used operator overloading frequently 
across wide ranges of types. For example, you've used: 


e the + operator for adding numeric values, concatenating lists, concatenating strings and 


adding a value to every element in a NumPy array. 


e the [] operator for accessing elements in lists, tuples, strings and arrays and for accessing 


the value for a specific key in a dictionary. 


e the * operator for multiplying numeric values, repeating a sequence and multiplying every 


element in a NumPy array by a specific value. 


You can overload most operators. For every overloadable operator, class object defines a 
special method, suchas add _ _ forthe addition (+) operatoror _mul__ forthe 
multiplication (*) operator. Overriding these methods enables you to define how a given 


operator works for objects of your custom class. For a complete list of special methods, see 


ttps://docs.python.org/3/reference/datamodel.html#special-method-names 


Operator Overloading Restrictions 


There are some restrictions on operator overloading: 


e The precedence of an operator cannot be changed by overloading. However, parentheses 


can be used to force evaluation order in an expression. 


e The left-to-right or right-to-left grouping of an operator cannot be changed by 
overloading. 


e The “arity” of an operator—that is, whether it’s a unary or binary operator—cannot be 
changed. 


e You cannot create new operators—only existing operators can be overloaded. 


e The meaning of how an operator works on objects of built-in types cannot be changed. 
You cannot, for example, change + so that it subtracts two integers. 


e Operator overloading works only with objects of custom classes or with a mixture of an 
object of a custom class and an object of a built-in type. 


Complex Numbers 


To demonstrate operator overloading, we'll define a class named Complex that represents 


complex numbers. 9 Complex numbers, like —3 + 4i and 6.2 — 11.73i, have the form 


8 Python has built-in support for complex values, so this class is simply for demonstration 


purposes. 


realPart + imaginaryPart * i 


where i is fl . Like ints, floats and Decimals, complex numbers are arithmetic types. 
In this section, we'll create a class Complex that overloads just the + addition operator and 
the += augmented assignment, so we can add Complex objects using Python’s mathematical 


notations. 


10.10.1 Test-Driving Class Complex 


First, let’s use class Complex to demonstrate its capabilities. We’ll discuss the class’s details 


in the next section. Import class Complex from complexnumber.py: 


lick here to view code image 
In [1]: from complexnumber import Complex 


Next, create and display a couple of Complex objects. Snippets [3] and [5] implicitly call 


the Complex class _ repr ___ method to get a string representation of each object: 


lick here to view code image 














In [2]: x = Complex(real=2, imaginary=4) 
me Sx 

Out kolk (2 F 41) 

In [4]: y = Complex (real=5, imaginary=-1) 
braves MSS Sy 

onone Eo E E e a) 


We chosethe_ repr _ string format shown in snippets [3] and [5] to mimic the 


__repr__ strings produced by Python’s built-in complex type. ” 


? Python uses j rather than i for . For example, 3+4 4 (with no spaces around the operator) 
creates a complex object with real and imag attributes. The repr ___ string for this 


complex value is ' (3+4j) '. 


Now, let’s use the + operator to add the Complex objects x and y. This expression adds the 
real parts of the two operands (2 and 5) and the imaginary parts of the two operands (4i and 


-1i), then returns a new Complex object containing the result: 


ina nile e 
Grohe TAE A Er a 
Tre kele y 
putre (a= 1a) 


Finally, let’s use the += operator to add y to x and store the result in x. The += operator 


modifies its left operand but not its right operand: 


In [9]: x += y 
In [10 x 

Cut PEOI Ge ae Sak) 
Cra Pll ey: 

Oue vs Spe ah) 





10.10.2 Class Complex Definition 


Now that we've seen class Complex in action, let’s look at its definition to see how those 


capabilities were implemented. 


Method _ init__ 





The class’s init __ method receives parameters to initialize the real and imaginary 


data attributes: 


lick here to view code image 


1 # complexnumber.py 

2 """Complex class with overloaded operators." 

3 

4 class Complex: 

5 """Complex class that represents a complex number 
6 With real and imaginary pares." 

7 


8 def init (self, real, imaginary): 





9 WN initvealsize Complex class "s ICCEIPULESS. IM 
10 self.real = real 

11 self.imaginary = imaginary 

12 


Overloaded + Operator 


The following overridden special method __add___ defines how to overload the + operator 


for use with two Complex objects: 


lick here to view code image 


13 der addi selt right): 

14 """Overrides the + operator., “T1 

15 return Complex(self.real + right.real, 

16 self.imaginary + right.imaginary) 
17 


Methods that overload binary operators must provide two parameters—the first (se1f) is the 
left operand and the second (right) is the right operand. Class Complex’s add _ method 
takes two Complex objects as arguments and returns a new Complex object containing the 


sum of the operands’ real parts and the sum of the operands’ imaginary parts. 


We do not modify the contents of either of the original operands. This matches our intuitive 
sense of how this operator should behave. Adding two numbers does not modify either of the 


original values. 


Overloaded += Augmented Assignment 


Lines 18—22 overload special method __iadd__ to define how the += operator adds two 


Complex objects: 


lick here to view code image 


18 det F radd (self right): 

19 """Overrides the += operator. Mu 
20 self.real += right.real 

21 self.imaginary += right.imaginary 
22 return self 

23 


Augmented assignments modify their left operands, so method _iadd__ modifies the self 


object, which represents the left operand, then returns self. 


Method __repr__ 


Lines 24—28 return the string representation of a Complex number. 


lick here to view code image 


24 def repr (self): 


25 """Return string representation for reprint 
26 return (f'({self.real} ' + 

27 ('+' if self.imaginary >= 0 else at 
28 E! tabs (self. imaginary) ri) t) 


10.11 EXCEPTION CLASS HIERARCHY AND CUSTOM 
EXCEPTIONS 


In the previous chapter, we introduced exception handling. Every exception is an object of a 
class in Python’s exception class hierarchy ° or an object of a class that inherits from one of 


those classes. Exception classes inherit directly or indirectly from base class 





BaseException and are defined in module exceptions. 











° ttps://docs.python.org/3/library/exceptions.html. 











Python defines four primary BaseException subclasses—SystemExit, 





KeyboardInterrupt, GeneratorExit and Exception: 





e SystemExit terminates program execution (or terminates an interactive session) and 


when uncaught does not produce a traceback like other exception types. 


e KeyboardInterrupt exceptions occur when the user types the interrupt command 





—Ctrl + C (or control + C) on most systems. 


e GeneratorExit exceptions occur when a generator closes—normally when a generator 





finishes producing values or when its close method is called explicitly. 


e Exception is the base class for most common exceptions you'll encounter. You’ve seen 














exceptions of the Exception subclasses ZeroDivisionError, NameError, 























ValueError, StatisticsError, TypeError, IndexError, KeyError, Runtime- 

















Error and AttributeError. Often, StandardErrors can be caught and handled, so 


the program can continue running. 


Catching Base-Class Exceptions 


One of the benefits of the exception class hierarchy is that an except handler can catch 
exceptions of a particular type or can use a base-class type to catch those base-class 
exceptions and all related subclass exceptions. For example, an except handler that 





specifies the base class Exception can catch objects of any subclass of Exception. Placing 








an except handler that catches type Exception before other except handlers is a logic 
error, because all exceptions would be caught before other exception handlers could be 


reached. Thus, subsequent exception handlers are unreachable. 


Custom Exception Classes 


When you raise an exception from your code, you should generally use one of the existing 


exception classes from the Python Standard Library. However, using the inheritance 


techniques presented earlier in this chapter, you can create your own custom exception 





classes that derive directly or indirectly from class Exception. Generally, that’s discouraged, 
especially among novice programmers. Before creating custom exception classes, look for an 
appropriate existing exception class in the Python exception hierarchy. Define new exception 
classes only if you need to catch and handle the exceptions differently from other existing 


exception types. That should be rare. 


10.12 NAMED TUPLES 


You've used tuples to aggregate several data attributes into a single object. The Python 
Standard Library’s collections module also provides named tuples that enable you to 


reference a tuple’s members by name rather than by index number. 


Let’s create a simple named tuple that might be used to represent a card in a deck of cards. 


First, import function namedtuple: 


lick here to view code image 


In [1]: from collections import namedtuple 


Function namedtup1e creates a subclass of the built-in tuple type. The function’s first 
argument is your new type’s name and the second is a list of strings representing the 


identifiers you'll use to reference the new type’s members: 


lick here to view code image 


In [2]: Card = namedtuple('Card', (*face", teure iy 


We now have a new tuple type named Card that we can use anywhere a tuple can be used. 


Let’s create a Card object, access its members by name and display its string representation: 


lick here to view code image 








In [3]: card = Card(face='Ace', suit='Spades') 
In [4]: card.face 

Out[4]: 'Ace' 

in [Sis card suit 

Out[5]: 'Spades' 

Tn Fel: Card 

Out[6]: Card(face='Ace', suit='Spades') 


Other Named Tuple Features 


Each named tuple type has additional methods. The type’s _make class method (that is, a 
method called on the class) receives an iterable of values and returns an object of the named 
tuple type: 


lick here to view code image 


In [7]: values = ['Queen', 'Hearts'] 

in PSl card = Card: make (values) 

In Pols card 

Out[9]: Card(face='Queen', suit='Hearts') 








This could be useful, for example, if you have a named tuple type representing records in a 
CSV file. As you read and tokenize CSV records, you could convert them into named tuple 


objects. 


For a given object of a named tuple type, you can get an OrderedDict dictionary 
representation of the object’s member names and values. An OrderedDict remembers the 


order in which its key—value pairs were inserted in the dictionary: 


lick here to view code image 


ine [lO] cardi asdaet () 
Ouer OrderedDicti[(*face", *OQuceen"), {*surt”, "Hearts')]) 


For additional named tuple features see: 


ttps://docs.python.org/3/library/collections.html#collections.namedtuple 


10.13 A BRIEF INTRO TO PYTHON 3.7’S NEW DATA 
CLASSES 


Though named tuples allow you to reference their members by name, they’re still just tuples, 
not classes. For some of the benefits of named tuples, plus the capabilities that traditional 
Python classes provide, you can use Python 3.7’s new data classes * from the Python 


Standard Library’s dataclasses module. 





* ttps://www.python.org/dev/peps/pep-0557/. 


Data classes are among Python 3.7’s most important new features. They help you build 
classes faster by using more concise notation and by autogenerating “boilerplate” code that’s 
common in most classes. They could become the preferred way to define many Python 
classes. In this section, we'll present data-class fundamentals. At the end of the section, we'll 


provide links to more information. 


Data Classes Autogenerate Code 


Most classes you'll define providean init _ method to create and initialize an object’s 
attributes anda repr ___ method to specify an object’s custom string representation. If a 


class has many data attributes, creating these methods can be tedious. 


Data classes autogenerate the data attributes andthe init and repr methods for 





you. This can be particularly useful for classes that primarily aggregate related data items. 
For example, in an application that processes CSV records, you might want a class that 
represents each record’s fields as data attributes in an object. Data classes also can be 


generated dynamically from a list of field names. 


Data classes also autogenerate method __eq _, which overloads the == operator. Any class 
that hasan eq_ method also implicitly supports ! =. All classes inherit class obj ect’s 


default ne ___ (not equals) method implementation, which returns the opposite of eq _ 





(or Not Implementedif the class does not define __eq__). Data classes do not automatically 


generate methods for the <, <=, > and >= comparison operators, but they can. 


10.13.1 Creating a Card Data Class 


Let’s reimplement class Card from ection 10.6.2 as a data class. The new class is defined in 
carddataclass.py. As youl see, defining a data class requires some new syntax. In the 
subsequent subsections, we'll use our new Card data class in class DeckOfCards to show 
that it’s interchangeable with the original Card class, then discuss some of the benefits of 


data classes over named tuples and traditional Python classes. 


Importing from the dataclasses and typing Modules 


The Python Standard Library’s dataclasses module defines decorators and functions for 
implementing data classes. We'll use the @dataclass decorator (imported at line 4) to 


specify that a new class is a data class and causes various code to be written for you. Recall 








that our original Card class defined class variables FACES and SUITS, which are lists of the 


strings used to initialize Cards. We use ClassVar and List from the Python Standard 








Library’s typing module (imported at line 5) to indicate that FACES and SUITS are class 


variables that refer to lists. We'll say more about these momentarily: 


lick here to view code image 
# carddataclass.py 
wMCanrd data class with class attributes, data attributes, 


from dataclasses import dataclass 


1 
2 
3 autogenerated methods and explicitly defined methods.""" 
4 
5 from typing import ClassVar, HA Site 

6 


Using the @dataclass Decorator 


To specify that a class is a data class, precede its definition with the @dataclass 


decorator: ? 


2 


ttps://docs.python.org/3/library/dataclasses.html#module-level- 


ecorators-classes--and-functions. 


7 @dataclass 


8 class Card: 


Optionally, the @dataclass decorator may specify parentheses containing arguments that 
help the data class determine what autogenerated methods to include. For example, the 
decorator @dataclass (order=True) would cause the data class to autogenerate 
overloaded comparison operator methods for <, <=, > and >=. This might be useful, for 


example, if you need to sort your data-class objects. 


Variable Annotations: Class Attributes 


Unlike regular classes, data classes declare both class attributes and data attributes inside the 
class, but outside the class’s methods. In a regular class, only class attributes are declared 
this way, and data attributes typically are createdin init __. Data classes require 
additional information, or hints, to distinguish class attributes from data attributes, which 


also affects the autogenerated methods’ implementation details. 


Lines 9—11 define and initialize the class attributes FACES and SUITS: 








lick here to view code image 


9 FACES: Clagevar eter Pee | si eer ee ee oe i i 

10 VOU a Om AlN see TERE A ea eeital ly, A E N | 
11 SULTS * CilrasisViateiLLaisit site|) = i hearts: “Diamonds”, "Clubs "Spades? 
12 











In lines 9 and 11, The notation 


$ CillassVar liist [Siar 





is a variable annotation * 4 (sometimes called a type hint) specifying that FACES is a class 





attribute (ClassVar) which refers to a list of strings (List [str] ). SUITS also is a class 


attribute which refers to a list of strings. 





3 ttps://www. python. org/dev/peps/pep-0526/. 


4Variable annotations are a recent language feature and are optional for regular classes. You 


will not see them in most legacy Python code. 


Class variables are initialized in their definitions and are specific to the class, not individual 


objects of the class. Methods init , repr and eq , however, are for use with 





objects of the class. When a data class generates these methods, it inspects all the variable 


annotations and includes only the data attributes in the method implementations. 


Variable Annotations: Data Attributes 


Normally, we create an object’s data attributes in the class’s_ init _ method (or methods 
called by init __) via assignments of the form se1f .attribute_name = value. Because a 
data class autogenerates its — init _ method, we need another way to specify data 


attributes in a data class’s definition. We cannot simply place their names inside the class, 





which generates a NameError, as in: 


lick here to view code image 


In [1]: from dataclasses import dataclass 


rn [2s @datachasis 
class Demo: 


x # attempting to create a data attribute x 


NameError Traceback (most recent call last) 
<ipython-input-2-79ffe37blba2> in <module>() 
----> 1 @dataclass 





2 class Demo: 
3 x # attempting to create a data attribute x 
4 


<ipython-input-2-79ffe37blba2> in Demo() 
1 @dataclass 
2 class Demo: 

Se) x # attempting to create a data attribute x 
4 





NameError: name 'x' is not defined 


Like class attributes, each data attribute must be declared with a variable annotation. Lines 
13-14 define the data attributes face and suit. The variable annotation": str" indicates 


that each should refer to string objects: 


13 face: str 
14 SU Es esta 


Defining a Property and Other Methods 


Data classes are classes, so they may contain properties and methods and participate in class 
hierarchies. For this Card data class, we defined the same read-only image _name property 


and custom special methods_ str_ and_ format __ asin our original Card class earlier 





in the chapter: 


lick here to view code image 


15 @property 


16 def image name(self): 

17 """Return the Card's image file name.""" 

18 return str(self).replace(' ', Ce pag! 
19 

20 den iste (Sele) is 

21 """Return string representation for SEE) e N 
22 return £'{iselt.face} of jiself.suit}’ 

23 

24 def format (self, format): 

25 """Return formatted string representation.""" 
26 return f {str (self): {format}? 


Variable Annotation Notes 


You can specify variable annotations using built-in type names (like str, int and float), 
class types or types defined by the typing module (such as ClassVar and List shown 
earlier). Even with type annotations, Python is still a dynamically typed language. So, type 
annotations are not enforced at execution time. So, even though a Card’s face is meant to be 


a string, you can assign any type of object to face. 


10.13.2 Using the Card Data Class 


Let’s demonstrate the new Card data class. First, create a Card: 


lick here to view code image 


in’ [Lijit trom earddataclass Importe Card 


Ta [2 el = Card (Card. FACESMOIT Cara SUIS FAI 


Next, let’s use Card’s autogenerated repr method to display the Card: 


lick here to view code image 


ery heed 
Out[3]: Card(face='Ace', suit='Spades') 


Our custom str method, which print calls when passing it a Card object, returns a 


string of the form 'face of suit’: 


TA [4s printet) 
Ace of Spades 


Let’s access our data class’s attributes and read-only property: 


lick here to view code image 


TA [Sis cls race 








Out iols “VAce? 

TAn hGilieecillyesuaie 

Out [6 "Spades' 

In [7]: cl.image_name 
Out[7 "Ace of Spades.png' 


Next, let’s demonstrate that Card objects can be compared via the autogenerated == 
operator and inherited ! = operator. First, create two additional Card objects—one identical 


to the first and one different: 


lick here to view code image 


in kelk e2 = Card(Card.FACES|(0], Card.couUlrs | 3} 


ie (Ay E2 
Out[9]: Card(face='Ace', suit='Spades') 


in [10]: e3 = Card(Card. FACES [0], Cardi S UTES TONN 





pe tae ess 


Out[11]: Card(face='Ace', suit='Hearts') 





Now, compare the objects using == and ! =: 











Our Card data class is interchangeable with the Card class developed earlier in this chapter. 
To demonstrate this, we created the deck2. py file containing a copy of class DeckOfCards 


from earlier in the chapter and imported the Card data class into the file. The following 





snippets import class DeckOfCards, create an object of the class and print it. Recall that 
print implicitly calls the DeckOfCards str method, which formats each Cardin a 
field of 19 characters, resulting in a call to each Card’s_— format ___ method. Read each row 
left-to-right to confirm that all the Cards are displayed in order from each suit (Hearts, 


Diamonds, Clubs and Spades): 
lick here to view code image 
In [15]: from deck2 import DeckOfCards # uses Card data class 


im [16]: deck of cards: = DeckOfCards(() 


ine ii print (deck Er ards) 


ds 


Ace of Hearts 2 of Hearts 3 of Hearts of Hearts 


5 of Hearts 6 of Hearts 7 of Hearts 8 of Hearts 

9 of Hearts 10 of Hearts Jack of Hearts Queen of Hearts 
King of Hearts Ace of Diamonds 2 of Diamonds 3 of Diamonds 

4 of Diamonds 5 of Diamonds 6 of Diamonds 7 of Diamonds 

8 of Diamonds 9 of Diamonds 10 of Diamonds Jack of Diamonds 
Queen of Diamonds King of Diamonds Ace of Clubs 2 of Clubs 

3 of Clubs 4 of Clubs By ene TeAlilers| 6 of Clubs 

1 Of Clubs 8 of Clubs Oo WGlrubis 1O Of Clubs 
Jack of Clubs Queen of Clubs King of Clubs Ace of Spades 

2 of Spades 3 of Spades 4 of Spades 5 of Spades 

6 of Spades 7 of Spades 8 of Spades 9 of Spades 

10 of Spades Jack of Spades Queen of Spades King of Spades 











10.13.3 Data Class Advantages over Named Tuples 


Data classes offer several advantages over named tuples °: 





5 ttps://www.python.org/dev/peps/pep-0526/. 


e Although each named tuple technically represents a different type, a named tuple is a 
tuple and all tuples can be compared to one another. So, objects of different named tuple 
types could compare as equal if they have the same number of members and the same 
values for those members. Comparing objects of different data classes always returns 


False, as does comparing a data class object to a tuple object. 


e Ifyou have code that unpacks a tuple, adding more members to that tuple breaks the 
unpacking code. Data class objects cannot be unpacked. So you can add more data 


attributes to a data class without breaking existing code. 


e A data class can be a base class or a subclass in an inheritance hierarchy. 


10.13.4 Data Class Advantages over Traditional Classes 


Data classes also offer various advantages over the traditional Python classes you saw earlier 
in this chapter: 


e Adata class autogenerates init, repr and eq ,saving you time. 








e A data class can autogenerate the special methods that overload the <, <=, > and >= 


comparison operators. 


e When you change data attributes defined in a data class, then use it in a script or 
interactive session, the autogenerated code updates automatically. So, you have less code 
to maintain and debug. 


e The required variable annotations for class attributes and data attributes enable you to 
take advantage of static code analysis tools. So, you might be able to eliminate additional 


errors before they can occur at execution time. 


e Some static code analysis tools and IDEs can inspect variable annotations and issue 
warnings if your code uses the wrong type. This can help you locate logic errors in your 


code before you execute it. 


More Information 


Data classes have additional capabilities, such as creating “frozen” instances which do not 
allow you to assign values to a data class object’s attributes after the object is created. For a 


complete list of data class benefits and capabilities, see 
ttps://www.python.org/dev/peps/pep-0557/ 
and 


ttps://docs.python.org/3/library/dataclasses.html 


10.14 UNIT TESTING WITH DOCSTRINGS AND DOCTEST 


A key aspect of software development is testing your code to ensure that it works correctly. 
Even with extensive testing, however, your code may still contain bugs. According to the 
famous Dutch computer scientist Edsger Dijkstra, “Testing shows the presence, not the 
absence of bugs.” ê 

6J. N. Buxton and B. Randell, eds, Software Engineering Techniques, April 1970, p. 16. 
Report on a conference sponsored by the NATO Science Committee, Rome, Italy, 2731 
October 1969 


Module doctest and the testmod Function 


The Python Standard Library provides the doctest module to help you test your code and 
conveniently retest it after you make modifications. When you execute the doctest 
module’s testmod function, it inspects your functions’, methods’ and classes' docstrings 
looking for sample Python statements preceded by >>>, each followed on the next line by the 
given statement’s expected output (if any). ” The testmod function then executes those 
statements and confirms that they produce the expected output. If they do not, testmod 
reports errors indicating which tests failed so you can locate and fix the problems in your 
code. Each test you define in a docstring typically tests a specific unit of code, such as a 


function, a method or a class. Such tests are called unit tests. 


7The notation >>> mimics the standard python interpreters input prompts. 


Modified Account Class 


The file accountdoctest.py contains the class Account from this chapter’s first example. 


We modified the — init _ method’s docstring to include four tests which can be used to 


ensure that the method works correctly: 


e The test in line 11 creates a sample Account object named account1. This statement 


does not produce any output. 


e The test in line 12 shows what the value of account1’s name attribute should be if line 11 


executed successfully. The sample output is shown in line 13. 


e The test in line 14 shows what the value of account 1’s balance attribute should be if 


line 11 executed successfully. The sample output is shown in line 15. 


e The test in line 18 creates an Account object with an invalid initial balance. The sample 





output shows that a ValueError exception should occur in this case. For exceptions, the 


doctest module’s documentation recommends showing just the first and last lines of the 


traceback. $ 


8 


ighlight=doctest#module-doctest. 


ttps://docs.python.org/3/library/doctest.html? 


You can intersperse your tests with descriptive text, such as line 17. 


lick here to view code image 








1 # accountdoctest.py 

ZOU Recount lass definition. Tu 

3 from decimal import Decimal 

4 

5 class: Accounts 

6 we“ Recount. Class: tor demonstrating doctest a ari 

7 

8 def init__(self, name, balance): 

9 re" inittalize an Account object. 

10 

a7 >>> accountl = Account('John Green', Decimal ('50.00') 
12 >>> accountl.name 

13 'John Green' 

14 >>> account l -balance 

15 Decimal ('50.00') 

16 

17 The balance argument must be greater than or equal to 0, 
18 >>> account2 = Account ('John Green', Decimal ('-50.00"')) 
19 Traceback (most recent call last): 

20 

21 ValueError: Initial balance must be >= to 0.00. 

22 TEN 

23 

24 # if balance is less than 0.00, raise an exception 

25 if balance < Decimal('0.00'): 

26 raise ValueError('Initial balance must be =" to 000: 
27 

28 self.name = name 

29 self.balance = balance 


w 
Oo 


31 def deposit (self, amount): 








32 """Deposit money to the ACEON e 

33 

34 # if amount is less than 0.00, raise an exception 
35 IE amount < Decimal (T0. 00T): 

36 raise ValueError('amount must be positives”) 
37 

38 self.balance += amount 

39 

40 if name == ' main ': 

41 import doctest 

42 doctest.testmod (verbose=True) 


Module main | 


When you load any module, Python assigns a string containing the module’s name to a global 
attribute of the module called — name __. When you execute a Python source file (such as 
accountdoctest.py) as a script, Python uses the string '__main__' as the module’s 
name. Youcanuse name in an if statement like lines 40—42 to specify code that should 
execute only if the source file is executed as a script. In this example, line 41 imports the 
doctest module and line 42 calls the module’s testmod function to execute the docstring 


unit tests. 


Running Tests 


Run the file accountdoctest.py as a script to execute the tests. By default, if you call 
testmod with no arguments, it does not show test results for successful tests. In that case, if 
you get no output, all the tests executed successfully. In this example, line 42 calls testmod 
with the keyword argument verbose=True. This tells testmod to produce verbose output 


showing every test’s results: 


lick here to view code image 


Trying: 
accountl = Account('John Green', Decimal ('50.00')) 
Expecting nothing 
ok 
TEVEN 


accountl.name 





Expecting: 
"John Green' 
ok 
Trying: 
accountl.balance 
Expecting: 
Decimal ('50.00') 
ok 
Ube waligle(e 
account? = Account (John Green’, Decimal ('=50.00")) 
Expecting: 


Traceback (most recent call last): 





ValueError: Initial balance must be >= to 0.00. 
ok 


3 items had no tests: 
mainh 
_ main__.Account 
_ main -Account deposit 
1 items passed all tests: 
4 tests in main Account. rnit 
4 tests in 4 items. 
4 passed and 0 failed. 
Test passed. 


In verbose mode, testmod shows for each test what its "Trying" to do and what it’s 


"Expecting" as a result, followed by "ok" if the test is successful. After completing the 





tests in verbose mode, testmod shows a summary of the results. 


To demonstrate a failed test, “comment out” lines 25-26 in accountdoctest.py by 
preceding each with a #, then run accountdoctest.py asa script. To save space, we show 


just the portions of the doctest output indicating the failed test: 


lick here to view code image 


KKK KK KKK KK KKK KKK KKK KKK KKK KK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKKKKK KKK KK KKK 


Eile, taccountdoctest: py; line rar am _ main Account. inie 
Failed example: 

account2 = Account (John Green”, Decimal ('-50.00")) 
Expected: 


Traceback (most recent call last): 


ValueError: Initial balance must be >= to 0.00. 
Got nothing 
KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK 
1 items had failures: 
il eRe in main Account. init 
4 tests in 4 items. 
3 passed and 1 failed. 


AAAT SSC, Hailed«** i failures, 


In this case, we see that line 18’s test failed. The testmod function was expecting a traceback 


indicating that a ValueError was raised due to the invalid initial balance. That exception 





did not occur, so the test failed. As the programmer responsible for defining this class, this 
failing test would be an indication that something is wrong with the validation code in your 


_ init __ method. 


IPython sdoctest_ mode Magic 


A convenient way to create doctests for existing code is to use an IPython interactive session 





to test your code, then copy and paste that session into a docstring. [Python’s In [] and 
Out [] prompts are not compatible with doctest, so IPython provides the magic 
sdoctest_mode to display prompts in the correct doctest format. The magic toggles 
between the two prompt styles. The first time you execute sdoctest_mode, [Python 


switches to >>> prompts for input and no output prompts. The second time you execute 





doctest_mode, [Python switches back to In [] and Out [] prompts. 


10.15 NAMESPACES AND SCOPES 


In the “Functions” chapter, we showed that each identifier has a scope that determines where 
you can use it in your program, and we introduced the local and global scopes. Here we 


continue our discussion of scopes with an introduction to namespaces. 


Scopes are determined by namespaces, which associate identifiers with objects and are 
implemented “under the hood” as dictionaries. All namespaces are independent of one 
another. So, the same identifier may appear in multiple namespaces. There are three primary 


namespaces—local, global and built-in. 


Local Namespace 


Each function and method has a local namespace that associates local identifiers (such as, 
parameters and local variables) with objects. The local namespace exists from the moment 
the function or method is called until it terminates and is accessible only to that function or 
method. In a function’s or method’s suite, assigning to a variable that does not exist creates a 
local variable and adds it to the local namespace. Identifiers in the local namespace are in 


scope from the point at which you define them until the function or method terminates. 


Global Namespace 


Each module has a global namespace that associates a module’s global identifiers (such as 
global variables, function names and class names) with objects. Python creates a module’s 
global namespace when it loads the module. A module’s global namespace exists and its 
identifiers are in scope to the code within that module until the program (or interactive 
session) terminates. An IPython session has its own global namespace for all the identifiers 


you create in that session. 


Each module’s global namespace also has an identifier called ___name___ containing the 
module’s name, such as 'math' for the math module or 'random' for the random module. 
As you saw in the previous section’s doctest example, name contains ' main _' for 


a .py file that you run as a script. 


Built-In Namespace 


The built-in namespace contains associates identifiers for Python’s built-in functions 
(such as, input and range) and types (such as, int, float and str) with objects that 
define those functions and types. Python creates the built-in namespace when the interpreter 
starts executing. The built-in namespace’s identifiers remain in scope for all code until the 


program (or interactive session) terminates. ° 


? his assumes you do not shadow the built-in functions or types by redefining their 


identifiers in a local or global namespace. We discussed shadowing in the Functions chapter. 


Finding Identifiers in Namespaces 


When you use an identifier, Python searches for that identifier in the currently accessible 
namespaces, proceeding from local to global to built-in. To help you understand the 


namespace search order, consider the following [Python session: 


lick here to view code image 


ine Hz = F alobal: 

m [Zils det print variraplesit): 
y = "local y in peint variables! 
Printy) 
pPEINE(Z) 

tasie printi variablesit) 


local y in print variables 


global z 


The identifiers you define in an IPython session are placed in the session’s global namespace. 
When snippet [3] calls print variables, Python searches the local, global and built-in 


namespaces as follows: 


e Snippet [3] is not in a function or method, so the session’s global namespace and the 
built-in namespace are currently accessible. Python first searches the session’s global 
namespace, which contains print variables.Soprint variables is in scope and 


Python uses the corresponding object to call print variables. 


e Asprint variables begins executing, Python creates the function’s local namespace. 
When function print variables defines the local variable y, Python adds y to the 
function’s local namespace. The variable y is now in scope until the function finishes 


executing. 


e Next, print variables calls the built-in function print, passing y as the argument. 
To execute this call, Python must resolve the identifiers y and print. The identifier y is 
defined in the local namespace, so it’s in scope and Python will use the corresponding 


object (the string 'local y in print variables')as print’s argument. To call the 





function, Python must find print’s corresponding object. First, it looks in the local 
namespace, which does not define print. Next, it looks in the session’s global 
namespace, which does not define print. Finally, it looks in the built-in namespace, 
which does define print. So, print is in scope and Python uses the corresponding 


object to call print. 


e Next, print variables calls the built-in function print again with the argument z, 
which is not defined in the local namespace. So, Python looks in the global namespace. 
The argument z is defined in the global namespace, so z is in scope and Python will use 


the corresponding object (the string 'global z')asprint’s argument. Again, Python 


finds the identifier print in the built-in namespace and uses the corresponding object to 


call print. 


e At this point, we reach the end of the print variables function’s suite, so the function 
terminates and its local namespace no longer exists, meaning the local variable y is now 


undefined. 


To prove that y is undefined, let’s try to display y: 


lick here to view code image 





In [4]: y 

NameError Traceback (most recent call last) 
<ipython-input-—4-9063a9f0e032> in <module>() 

----> ly 

NameError: name ty" is not defined 





In this case, there’s no local namespace, so Python searches for y in the session’s global 
namespace. The identifier y is not defined there, so Python searches for y in the built-in 


namespace. Again, Python does not find y. There are no more namespaces to search, so 





Python raises a NameError, indicating that y is not defined. 


The identifiers print variables and z still exist in the session’s global namespace, so we 


can continue using them. For example, let’s evaluate z to see its value: 


bro alee 
Ours Poli Voltobadk z! 


Nested Functions 


One namespace we did not cover in the preceding discussion is the enclosing namespace. 
Python allows you to define nested functions inside other functions or methods. For 
example, if a function or method performs the same task several times, you might define a 
nested function to avoid repeating code in the enclosing function. When you access an 
identifier inside a nested function, Python searches the nested function’s local namespace 
first, then the enclosing function’s namespace, then the global namespace and finally the 
built-in namespace. This is sometimes referred to as the LEGB (local, enclosing, global, 


built-in) rule. 


Class Namespace 


A class has a namespace in which its class attributes are stored. When you access a class 
attribute, Python looks for that attribute first in the class’s namespace, then in the base class’s 


namespace, and so on, until either it finds the attribute or it reaches class object. If the 





attribute is not found, a NameError occurs. 


Object Namespace 


Each object has its own namespace containing the object’s methods and data attributes. The 
class init _ method starts with an empty object (se1f) and adds each attribute to the 
object’s namespace. Once you define an attribute in an object’s namespace, clients using the 


object may access the attribute’s value. 


10.16 INTRO TO DATA SCIENCE: TIME SERIES AND SIMPLE 
LINEAR REGRESSION 


We've looked at sequences, such as lists, tuples and arrays. In this section, we'll discuss time 
series, which are sequences of values (called observations) associated with points in time. 
Some examples are daily closing stock prices, hourly temperature readings, the changing 
positions of a plane in flight, annual crop yields and quarterly company profits. Perhaps the 
ultimate time series is the stream of time-stamped tweets coming from Twitter users 


worldwide. In the “Data Mining Twitter” chapter, we'll study Twitter data in depth. 


In this section, we'll use a technique called simple linear regression to make predictions from 
time series data. We'll use the 1895 through 2018 January average high temperatures in New 
York City to predict future average January high temperatures and to estimate the average 


January high temperatures for years preceding 1895. 


In the “Machine Learning” chapter, we'll revisit this example using the scikit-learn library. In 
the “Deep Learning” chapter, we'll use recurrent neural networks (RNNs) to analyze time 


series. 


In later chapters, we'll see that time series are popular in financial applications and with the 
Internet of Things (IoT), which we'll discuss in the “ ig Data: Hadoop, Spark, NoSQL and 
oT” chapter. 


In this section, we'll display graphs with Seaborn and pandas, which both use Matplotlib, so 
launch [Python with Matplotlib support: 


ipython --matplotlib 


Time Series 


The data we'll use is a time series in which the observations are ordered by year. Univariate 
time series have one observation per time, such as the average of the January high 
temperatures in New York City for a particular year. Multivariate time series have two or 
more observations per time, such as temperature, humidity and barometric pressure 


readings in a weather application. Here, we'll analyze a univariate time series. 


Two tasks often performed with time series are: 


e Time series analysis, which looks at existing time series data for patterns, helping data 
analysts understand the data. A common analysis task is to look for seasonality in the 


data. For example, in New York City, the monthly average high temperature varies 
significantly based on the seasons (winter, spring, summer or fall). 


e Time series forecasting, which uses past data to predict the future. 


We'll perform time series forecasting in this section. 


Simple Linear Regression 


Using a technique called simple linear regression, we'll make predictions by finding a 
linear relationship between the months (January of each year) and New York City’s average 
January high temperatures. Given a collection of values representing an independent 
variable (the month/year combination) and a dependent variable (the average high 
temperature for that month/year), simple linear regression describes the relationship 


between these variables with a straight line, known as the regression line. 


Linear Relationships 


To understand the general concept of a linear relationship, consider Fahrenheit and Celsius 
temperatures. Given a Fahrenheit temperature, we can calculate the corresponding Celsius 


temperature using the following formula: 


hy ae ne Sh) 


In this formula, f (the Fahrenheit temperature) is the independent variable, and c (the 
Celsius temperature) is the dependent variable—each value of c depends on the value of f 


used in the calculation. 


Plotting Fahrenheit temperatures and their corresponding Celsius temperatures produces a 
straight line. To show this, let’s first create a Lambda for the preceding formula and use it to 
calculate the Celsius equivalents of the Fahrenheit temperatures O—100 in 10-degree 
increments. We store each Fahrenheit/Celsius pair as a tuple in temps: 


lick here to view code image 


im (ij es= Lambda is 9 7/ 93 CE — 32) 


ine (Zi temps = (iss eh) TON E in rangeelo AO Coy 





Next, let’s place the data in a DataFrame, then use its plot method to display the linear 

relationship between the Fahrenheit and Celsius temperatures. The plot method’s style 
keyword argument controls the data’s appearance. The period in the string ' .-' indicates 

that each point should appear as a dot, and the dash indicates that lines should connect the 
dots. We manually set the y-axis label to 'Celsius' because the plot method shows 


'Celsius' only in the graph’s upper-left corner legend, by default. 


lick here to view code image 


In [3]: import pandas as pd 
in Min temps df -— pd Databkrame (temps, Columns=i Fahrenheiti  “Celsaus ih 


in Silks axes = temps dt.ploti(x={"Rahrenhexc", y='Celsius', style='.-') 








im pols y label = axesyiset ylabeli(' Celsius”) 
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Components of the Simple Linear Regression Equation 


The points along any straight line (in two dimensions) like those shown in the preceding 


graph can be calculated with the equation: 
y=mx+b 
where 
e mis the line’s slope, 
e bis the line’s intercept with the y-axis (at x = 0), 
e xis the independent variable (the date in this example), and 


e y is the dependent variable (the temperature in this example). 


In simple linear regression, y is the predicted value for a given x. 


unction 1inregress from the SciPy’s stats Module 


Simple linear regression determines the slope (m) and intercept (b) of a straight line that best 
fits your data. Consider the following diagram, which shows a few of the time-series data 
points we'll process in this section and a corresponding regression line. We added vertical 


lines to indicate each data point’s distance from the regression line: 
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The simple linear regression algorithm iteratively adjusts the slope and intercept and, for 
each adjustment, calculates the square of each point’s distance from the line. The “best fit” 
occurs when the slope and intercept values minimize the sum of those squared distances. 


This is known as an ordinary least squares calculation. ° 


° ttps://en.wikipedia.org/wi ki/Ordinary least _squares. 

The SciPy (Scientific Python) library is widely used for engineering, science and math in 
Python. This library’s linregress function (from the scipy. stats module) performs 
simple linear regression for you. After calling 1inregress, you'll plug the resulting slope 


and intercept into the y = mx + b equation to make predictions. 


Pandas 


In the three previous Intro to Data Science sections, you used pandas to work with data. 
You'll continue using pandas throughout the rest of the book. In this example, we'll load the 
data for New York City’s 1895-2018 average January high temperatures from a CSV file into 


a DataFrame. We'll then format the data for use in this example. 


Seaborn Visualization 


We'll use Seaborn to plot the DataFrame’s data with a regression line that shows the average 


high-temperature trend over the period 1895-2018. 


Getting Weather Data from NOAA 


Let’s get the data for our study. The National Oceanic and Atmospheric Administration 
(NOAA) * offers lots of public historical data including time series for average high 


temperatures in specific cities over various time intervals. 


* ttp://www.noaa.gov. 


We obtained the January average high temperatures for New York City from 1895 through 
2018 from NOAA’s “Climate at a Glance” time series at: 


ttps://www.ncdc.noaa.gov/cag/ 


On that web page, you can select temperature, precipitation and other data for the entire 
U.S., regions within the U.S., states, cities and more. Once you've set the area and time frame, 
click Plot to display a diagram and view a table of the selected data. At the top of that table 
are links for downloading the data in several formats including CSV, which we discussed in 
the “Files and Exceptions” chapter. NOAA’s maximum date range available at the time of this 
writing was 1895-2018. For your convenience, we provided the data in the ch10 examples 


folder in the file ave hi nyc jan 1895-2018.csv. If you download the data on your 





own, delete the rows above the line containing "Date, Value, Anomaly". 


This data contains three columns per observation: 


e Date—A value of the form 'YYyYMM’ (such as '201801'). MM is always 01 because we 


downloaded data for only January of each year. 
e Value—A floating-point Fahrenheit temperature. 


e Anomaly—The difference between the value for the given date and average values for all 


dates. We do not use the Anomaly value in this example, so we'll ignore it. 


Loading the Average High Temperatures into a DataFrame 


Let’s load and display the New York City data from ave hi nyc jan 1895-2018.csv: 





lick here to view code image 


in Tay nye = pd: read esy (lave hi nye Jan 895 —2 0s csi) 


We can look at the DataFrame’s head and tail to get a sense of the data: 


lick here to view code image 


In [8]: nyc.head() 
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Cleaning the Data 


We'll soon use Seaborn to graph the Date-Value pairs and a regression line. When plotting 
data from a DataFrame, Seaborn labels a graph’s axes using the DataFrame’s column 


names. For readability, let’s rename the 'Value' column as 'Temperature!': 


lick here to view code image 


In [10]: nye columns = ["Date", 'Temperature', 'Anomaly'] 


In [11]: nyc.head(3) 








gue Cie 
Dat Temperatur Anomaly 
13:9:5.01 34.2 Se 
189601 34.7 EA al 
USO a Sib =179 


Seaborn labels the tick marks on the x-axis with Date values. Since this example processes 
only January temperatures, the x-axis labels will be more readable if they do not contain 01 


(for January), we'll remove it from each Date. First, let’s check the column’s type: 


In [12]: nyc.Date.dtype 
Out[12]: dtype('int64') 


The values are integers, so we can divide by 100 to truncate the last two digits. Recall that 
each column in a DataFrame is a Series. Calling Series method floordiv performs 


integer division on every element of the Series: 


lick here to view code image 


In [13]: nyc.Date = nyc.Date.floordiv(100) 


In [14]: nyc.head(3) 


Out[14]: 
Dat Temperatur Anomaly 





OERS 34.2 Soir 
1896 34.7 =e) 
21397; 399 Sal) 


Calculating Basic Descriptive Statistics for the Dataset 


For some quick statistics on the dataset’s temperatures, call describe on the Temperature 
column. We can see that there are 124 observations, the mean value of the observations is 
37.60, and the lowest and highest observations are 26.10 and 47.60 degrees, respectively: 


lick here to view code image 


ine [Sie parsec option precision” ; 2) 


In [16]: nyc.Temperature.describe() 





Out [16]: 

count 124.00 
mean Shs 60 
stad 4.54 
min Zomig 
25% 34.58 
50% 37.60 
75% 40.60 
max 47.60 





Name: Temperature, dtype: float64 





Forecasting Future January Average High Temperatures 


The SciPy (Scientific Python) library is widely used for engineering, science and math in 
Python. Its stats module provides function linregress, which calculates a regression 


line’s slope and intercept for a given set of data points: 


lick here to view code image 


mm [is teem seipy Import stats 


in [16]: linear regression = stats.linregress(x=nye.Date, 





y=nyc.Temperature) 


Function linregress receives two one-dimensional arrays ° of the same length 
representing the data points’ x- and y-coordinates. The keyword arguments x and y represent 
the independent and dependent variables, respectively. The object returned by linregress 


contains the regression line’s slope and intercept: 


*These arguments also can be one-dimensional array-like objects, such as lists or pandas 


Series. 


lick here to view code image 


In [19]: linear regression- slope 
Que MEIN: D0 00OL4A TTS 61s 29661677, 


im [20]: lanear regresston.intercept 
Out[20]: 8.694845520062952 


We can use these values with the simple linear regression equation for a straight line, y = mx 
+ b, to predict the average January temperature in New York City for a given year. Let’s 


predict the average Fahrenheit temperature for January of 2019. In the following calculation, 





linear regression.slope ism, 2019 is x (the date value for which you'd like to predict 





the temperature), and linear regression.intercept is b: 


lick here to view code image 


ine [21 dhinearlregqresismon.slope: A 2.09 + linear regression. Intercept 
Out P24) 2 Seo Us STs sodas 293 


We also can approximate what the average temperature might have been in the years before 


1895. For example, let’s approximate the average temperature for January of 1890: 


lick here to view code image 


In [22]: linear regresswvon.slope * 1890 + linear regression. intercept 
OUE22 I 36.61 29:6579.7 49810335 


For this example, we had data for 1895-2018. You should expect that the further you go 
outside this range, the less reliable the predictions will be. 


Plotting the Average High Temperatures and a Regression Line 


Next, let’s use Seaborn’s regplot function to plot each data point with the dates on the x- 
axis and the temperatures on the y-axis. The regplot function creates the scatter plot or 
scattergram below in which the scattered dots represent the Temperatures for the given 


Dates, and the straight line displayed through the points is the regression line: 


Temperature 
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1900 1920 1940 1960 1980 2000 2020 


Date 


First, close the prior Matplotlib window if you have not done so already—otherwise, 
regplot will use the existing window that already contains a graph. Function regplot’s x 
and y keyword arguments are one-dimensional arrays ° of the same length representing the 
x-y coordinate pairs to plot. Recall that pandas automatically creates attributes for each 


column name if the name can be a valid Python identifier: 4 


3These arguments also can be one-dimensional array-like objects, such as lists or pandas 


Series. 


‘For readers with a more statistics background, the shaded area surrounding the regression 
line is the 95% confidence interval for the regression line 


( ttps://en.wikipedia.org/wiki/Simple linear regression#Confidence interva 








o draw the diagram without a confidence interval, add the keyword argument ci=None to 


the regplot functions argument list. 


lick here to view code image 


In [23]: import seaborn as sns 


Im [24]: sns.set_style("whitegrid” ) 





In [25]: axes = sns.regplot(x=nyc.Date, y=nyc.Temperature) 


The regression line’s slope (lower at the left and higher at the right) indicates a warming 
trend over the last 124 years. In this graph, the y-axis represents a 21.5-degree temperature 
range between the minimum of 26.1 and the maximum of 47.6, so the data appears to be 


pread significantly above and below the regression line, making it difficult to see the linear 
relationship. This is a common issue in data analytics visualizations. When you have axes that 
reflect different kinds of data (dates and temperatures in this case), how do you reasonably 
determine their respective scales? In the preceding graph, this is purely an issue of the 
graph’s height—Seaborn and Matplotlib auto-scale the axes, based on the data’s range of 
values. We can scale the y-axis range of values to emphasize the linear relationship. Here, we 


scaled the y-axis from a 21.5-degree range to a 60-degree range (from 10 to 70 degrees): 


lick here to view code image 


ine [26s axes set ylim 710!) 
cue l2: (L0 70) 


70 


60 i | : | — 














Temperature 





20 














10 * 





1900 1920 1940 1960 1980 2000 2020 
Date 


Getting Time Series Datasets 


Here are some popular sites where you can find time series to use in your studies: 


Sources time-series dataset 





ttps://data.gov/ 


This is the U.S. government’s open data portal. Searching for “time series” yields 
over 7200 time-series datasets. 


ttps://www.ncdc.noaa.gov/cag/ 


The National Oceanic and Atmospheric Administration (NOAA) Climate at a Glance 


portal provides both global and U.S. weather-related time series. 


ttps://www.esrl.noaa.gov/psd/data/timeseries/ 


NOAA’s Earth System Research Laboratory (ESRL) portal provides monthly and 


seasonal climate-related time series. 


ttps://www.quandl.com/search 


Quand! provides hundreds of free financial-related time series, as well as fee-based 


time series. 


ttps://datamarket.com/data/list/?q=provider:tsdl 


The Time Series Data Library (TSDL) provides links to hundreds of time series 


datasets across many industries. 


ttp://archive.ics.uci.edu/ml/datasets.html 


The University of California Irvine (UCI) Machine Learning Repository contains 


dozens of time-series datasets for a variety of topics. 


ttp://inforumweb.umd.edu/econdata/econdata. html 


The University of Maryland’s EconData service provides links to thousands of 


economic time series from various U.S. government agencies. 


10.17 WRAP-UP 


In this chapter, we discussed the details of crafting valuable classes. You saw how to define a 
class, create objects of the class, access an object’s attributes and call its methods. You 


defined the special method init __ to create and initialize a new object’s data attributes. 


We discussed controlling access to attributes and using properties. We showed that all object 


ttributes may be accessed directly by a client. We discussed identifiers with single leading 
underscores (_), which indicate attributes that are not meant to be accessed by client code. 
We showed how to implement “private” attributes via the double-leading-underscore (__) 


naming convention, which tells Python to mangle an attribute’s name. 


We implemented a card shuffling and dealing simulation consisting of a Card class and a 
DeckOfCards class that maintained a list of Cards, and displayed the deck both as strings 


and as card images using Matplotlib. We introduced special methods repr ,_ str 





and_ format __ for creating string representations of objects. 


Next, we looked at Python’s capabilities for creating base classes and subclasses. We showed 
how to create a subclass that inherits many of its capabilities from its superclass, then adds 
more capabilities, possibly by overriding the base class’s methods. We created a list 
containing both base class and subclass objects to demonstrate Python’s polymorphic 


programming capabilities. 


We introduced operator overloading for defining how Python’s built-in operators work with 
objects of custom class types. You saw that overloaded operator methods are implemented by 
overriding various special methods that all classes inherit from class obj ect. We discussed 


the Python exception class hierarchy and creating custom exception classes. 


We showed how to create a named tuple that enables you to access tuple elements via 
attribute names rather than index numbers. Next, we introduced Python 3.7’s new data 
classes, which can autogenerate various boilerplate code commonly provided in class 
definitions, such asthe init , repr and eq special methods. 








You saw how to write unit tests for your code in docstrings, then execute those tests 
conveniently via the doctest module’s testmod function. Finally, we discussed the various 


namespaces that Python uses to determine the scopes of identifiers. 


In the next part of the book, we present a series of implementation case studies that use a mix 
of AI and big-data technologies. We explore natural language processing, data mining 
Twitter, IBM Watson and cognitive computing, supervised and unsupervised machine 
learning, and deep learning with convolutional neural networks and recurrent neural 
networks. We discuss big-data software and hardware infrastructure, including NoSQL 
databases, Hadoop and Spark with a major emphasis on performance. Yov’re about to see 


some really cool stuff! 


11. Natural Language Processing (NLP) 


Objectives 
In this chapter you'll: 


m Perform natural language processing (NLP) tasks, which are fundamental to many of the 
forthcoming data science case study chapters. 


m Run lots of NLP demos. 


m Use the TextBlob, NLTK, Textatistic and spaCy NLP libraries and their pretrained models 
to perform various NLP tasks. 


m Tokenize text into words and sentences. 
m Use parts-of-speech tagging. 
m Use sentiment analysis to determine whether text is positive, negative or neutral. 


m Detect the language of text and translate between languages using TextBlob’s Google 
Translate support. 


m Get word roots via stemming and lemmatization. 

m Use TextBlob’s spell checking and correction capabilities. 

m Get word definitions, synonyms and antonyms. 

m Remove stop words from text. 

m Create word clouds. 

m Determine text readability with Textatistic. 

m Use the spaCy library for named entity recognition and similarity detection. 


Outline 





1.1 Introduction 
1.2 TextBlob 
1.2.1 Create a TextBlob 


1.2.2 Tokenizing Text into Sentences and Words 


1.2.3 Parts-of-Speech Tagging 
1.2.4 Extracting Noun Phrases 


1.2.5 Sentiment Analysis with TextBlob’s Default Sentiment Analyzer 





1.2.6 Sentiment Analysis with the NaiveBayesAnalyzer 

1.2.7 Language Detection and Translation 

1.2.8 Inflection: Pluralization and Singularization 

1.2.9 Spell Checking and Correction 

1.2.10 Normalization: Stemming and Lemmatization 

1.2.11 Word Frequencies 

1.2.12 Getting Definitions, Synonyms and Antonyms from WordNet 
1.2.13 Deleting Stop Words 

1.2.14 n-grams 

1.3 Visualizing Word Frequencies with Bar Charts and Word Clouds 
1.3.1 Visualizing Word Frequencies with Pandas 

1.3.2 Visualizing Word Frequencies with Word Clouds 

1.4 Readability Assessment with Textatistic 

1.5 Named Entity Recognition with spaCy 

1.6 Similarity Detection with spaCy 

1.7 Other NLP Libraries and Tools 

1.8 Machine Learning and Deep Learning Natural Language Applications 
1.9 Natural Language Datasets 


1.10 Wrap-Up 





11.1 INTRODUCTION 


Your alarm wakes you, and you hit the “Alarm Off” button. You reach for your smartphone 
and read your text messages and check the latest news clips. You listen to TV hosts 
interviewing celebrities. You speak to family, friends and colleagues and listen to their 
responses. You have a hearing-impaired friend with whom you communicate via sign 
language and who enjoys close-captioned video programs. You have a blind colleague who 
reads braille, listens to books being read by a computerized book reader and listens to a 


screen reader speak about what’s on his computer screen. You read emails, distinguishing 


unk from important communications and send email. You read novels or works of non- 
fiction. You drive, observing road signs like “Stop,” “Speed Limit 35” and “Road Under 


66 


Construction.” You give your car verbal commands, like “call home,” “play classical music” or 
ask questions like, “Where’s the nearest gas station?” You teach a child how to speak and 
read. You send a sympathy card to a friend. You read books. You read newspapers and 
magazines. You take notes during a class or meeting. You learn a foreign language to prepare 
for travel abroad. You receive a client email in Spanish and run it through a free translation 
program. You respond in English knowing that your client can easily translate your email 
back to Spanish. You are uncertain about the language of an email, but language detection 


software instantly figures that out for you and translates the email to English. 


These are examples of natural language communications in text, voice, video, sign 
language, braille and other forms with languages like English, Spanish, French, Russian, 
Chinese, Japanese and hundreds more. In this chapter, you'll master many natural language 
processing (NLP) capabilities through a series of hands-on demos and IPython sessions. 
You'll use many of these NLP capabilities in the upcoming data science case study chapters. 


Natural language processing is performed on text collections, composed of Tweets, Facebook 
posts, conversations, movie reviews, Shakespeare’s plays, historic documents, news items, 
meeting logs, and so much more. A text collection is known as a corpus, the plural of which 


is corpora. 


Natural language lacks mathematical precision. Nuances of meaning make natural language 
understanding difficult. A text’s meaning can be influenced by its context and the reader’s 
“world view.” Search engines, for example, can get to “know you” through your prior 
searches. The upside is better search results. The downside could be invasion of privacy. 


11.2 TEXTBLOB ` 


* ttps://textblob.readthedocs.io/en/latest/. 

TextBlob is an object-oriented NLP text-processing library that is built on the NLTK and 
pattern NLP libraries and simplifies many of their capabilities. Some of the NLP tasks 
TextBlob can perform include: 


e Tokenization—splitting text into pieces called tokens, which are meaningful units, 


such as words and numbers. 


e Parts-of-speech (POS) tagging—identifying each word’s part of speech, such as noun, 
verb, adjective, etc. 


e Noun phrase extraction—locating groups of words that represent nouns, such as “red 
brick factory.” ° 


? The phrase red brick factory illustrates why natural language is such a difficult subject. 
Is a red brick factory a factory that makes red bricks? Is it a red factory that makes bricks 
of any color? Is it a factory built of red bricks that makes products of any type? In todays 
music world, it could even be the name of a rock band or the name of a game on your 


smartphone. 


e Sentiment analysis—determining whether text has positive, neutral or negative 


sentiment. 
e Inter-language translation and language detection powered by Google Translate. 


e Inflection * —pluralizing and singularizing words. There are other aspects of inflection 
that are not part of TextBlob. 


3 ttps://en.wikipedia.org/wiki/Inflection. 


e Spell checking and spelling correction. 


e Stemming—reducing words to their stems by removing prefixes or suffixes. For 


example, the stem of “varieties” is “varieti.” 


e Lemmatization—like stemming, but produces real words based on the original words’ 
context. For example, the lemmatized form of “varieties” is “variety.” 


e Word frequencies—determining how often each word appears in a corpus. 
e WordNet integration for finding word definitions, synonyms and antonyms. 


e Stop word elimination—removing common words, such as a, an, the, I, we, you and 


more to analyze the important words in a corpus. 


e n-grams—producing sets of consecutive words in a corpus for use in identifying words 
that frequently appear adjacent to one another. 


Many of these capabilities are used as part of more complex NLP tasks. In this section, we'll 
perform these NLP tasks using TextBlob and NLTK. 


Installing the TextBlob Module 


To install TextBlob, open your Anaconda Prompt (Windows), Terminal (macOS/Linux) or 
shell (Linux), then execute the following command: 


lick here to view code image 


conda install -e conda-forge textblob 


Windows users might need to run the Anaconda Prompt as an Administrator for proper 
software installation privileges. To do so, right-click Anaconda Prompt in the start menu and 
select More > Run as administrator. 


Once installation completes, execute the following command to download the NLTK corpora 
used by TextBlob: 


lick here to view code image 


ipython -=m textblob.download_ corpora 


These include: 


e The Brown Corpus (created at Brown University ) for parts-of-speech tagging. 





4 ttps://en.wikipedia.org/wiki/Brown Corpus. 
e Punkt for English sentence tokenization. 
e WordNet for word definitions, synonyms and antonyms. 
e Averaged Perceptron Tagger for parts-of-speech tagging. 


e conll2000 for breaking text into components, like nouns, verbs, noun phrases and more— 
known as chunking the text. The name conll2000 is from the conference that created 


the chunking data—Conference on Computational Natural Language Learning. 


e Movie Reviews for sentiment analysis. 


Project Gutenberg 


A great source of text for analysis is the free e-books at Project Gutenberg: 
ttps://www.gutenberg.org 


The site contains over 57,000 e-books in various formats, including plain text files. These are 
out of copyright in the United States. For information about Project Gutenberg’sTerms of Use 


and copyright in other countries, see: 
ttps://www.gutenberg.org/wiki/Gutenberg:Terms_of_Use 


In some of this section’s examples, we use the plain-text e-book file for Shakespeare’s Romeo 


and Juliet, which you can find at: 
ttps://www.gutenberg.org/ebooks/1513 


Project Gutenberg does not allow programmatic access to its e-books. You’re required to copy 
the books for that purpose. ° To download Romeo and Juliet as a plain-text e-book, right 
click the Plain Text UTF-8 link on the book’s web page, then select Save Link As... 
(Chrome/FireFox), Download Linked File As... (Safari) or Save target as (Microsoft 
Edge) option to savethe book to your system. Save it as RomeoAndJuliet.txt inthe ch11 
examples folder to ensure that our code examples will work correctly. For analysis purposes, 
we removed the Project Gutenberg text before "THE TRAGEDY OF ROMEO AND JULIET", 




















as well as the Project Guttenberg information at the end of the file starting with: 
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ttps://www.gutenberg.org/wiki/Gutenberg: Information About Robot Access to ou 





End of the Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare 





> 





11.2.1 Create a TextBlob 


TextBlob is the fundamental class for NLP with the textblob module. Let’s create a 


TextBlob containing two sentences: 
6 


ttp://textblob.readthedocs.io/en/latest/api_reference.html#textblob.blob. Tex 


lick here to view code image 
[In [1]: from textblob import TextBlob 


In [2] ¢ text = "Today is a beautiful day- Tomorrow looks like bad weather.' 


[In [3]: blob = TextBlob (text) 





[In [4]: blob 
Out[4]: TextBlob("Today is a beautiful day. Tomorrow looks like bad weather." 


4| » 














extBlobs—and, as you'll see shortly, Sentences and Words—support string methods and 
can be compared with strings. They also provide methods for various NLP tasks. Sentences, 


Words and TextBlobs inherit from BaseBlob, so they have many common methods and 





properties. 


11.2.2 Tokenizing Text into Sentences and Words 


Natural language processing often requires tokenizing text before performing other NLP 





tasks. TextBlob provides convenient properties for accessing the sentences and words in 


TextBlobs. Let’s use the sentence property to get a list of Sentence objects: 


lick here to view code image 


In [5]: blob.sentences 

Swel IE 

[Sentence ("Today is a beautiful day."), 
Sentence ("Tomorrow looks like bad weather.")] 


The words property returns a WordList object containing a list of Word objects, 





representing each word in the TextBlob with the punctuation removed: 


lick here to view code image 


in kels blob .iwords: 
OUuE Ci Wordias tC Mhoday ys suis vat peanti tuk a dayi Tomorrow"; "looks: 


4 > 











1.2.3 Parts-of-Speech Tagging 


Parts-of-speech (POS) tagging is the process of evaluating words based on their context 
to determine each word’s part of speech. There are eight primary English parts of speech— 
nouns, pronouns, verbs, adjectives, adverbs, prepositions, conjunctions and interjections 
(words that express emotion and that are typically followed by punctuation, like “Yes!” or 


“Ha!”). Within each category there are many subcategories. 


Some words have multiple meanings. For example, the words “set” and “run” have hundreds 
of meanings each! If you look at the dictionary.com definitions of the word “run,” you'll 
see that it can be a verb, a noun, an adjective or a part of a verb phrase. An important use of 
POS tagging is determining a word’s meaning among its possibly many meanings. This is 
important for helping computers “understand” natural language. 


The tags property returns a list of tuples, each containing a word and a string representing 


its part-of-speech tag: 


lick here to view code image 


ine Pes blob 
Out[7]: TextBlob("Today is a beautiful day. Tomorrow looks like bad weather." 


a tels blobs tags 

OME HES 

Lev today ANNT 
(CETS a ENGEL), 
(GURU iplan lily 
(beauties) Cow i, 
("day', 'NN'), 
Comor row NINE); 
looks" WBT Jy 
(Tke n ENAN 
badk EII 
('weather', 'NN')] 





By default, TextBlob uses a PatternTagger to determine parts-of-speech. This class uses 





the parts-of-speech tagging capabilities of the pattern library: 
ttps://www.clips.uantwerpen.be/pattern 

You can view the library’s 63 parts-of-speech tags at 
ttps://www.clips.uantwerpen.be/pages/MBSP-tags 


In the preceding snippet’s output: 


e Today, day and weather are tagged as NN—a singular noun or mass noun. 





e isand looks are tagged as VBZ—a third person singular present verb. 
e ais tagged as DT—a determiner. 7 
7 ttps://en.wikipedia.org/wiki/Determiner. 


e beautiful and bad are tagged as JJ—an adjective. 


e Tomorrow is tagged as NNP—a proper singular noun. 





e like is tagged as IN—a subordinating conjunction or preposition. 


11.2.4 Extracting Noun Phrases 


Let’s say you're preparing to purchase a water ski so you’re researching them online. You 
might search for “best water ski.” In this case, “water ski” is a noun phrase. If the search 
engine does not parse the noun phrase properly, you probably will not get the best search 
results. Go online and try searching for “best water,” “best ski” and “best water ski” and see 


what you get. 





A TextBlob’s noun_phrases property returns a WordList object containing a list of 


Word objects—one for each noun phrase in the text: 


lick here to view code image 
PAPIE bloke 
Out[9]: TextBlob("Today is a beautiful day. Tomorrow looks like bad weather." 


n [10]: blob.noun phrases 
Out[10]: WordList(['beautiful day', 'tomorrow', 'bad weather']) 








4] ] > 








Note that a Word representing a noun phrase can contain multiple words. A WordList is an 
extension of Python’s built-in list type. WordLists provide additional methods for 


stemming, lemmatizing, singularizing and pluralizing. 


11.2.5 Sentiment Analysis with TextBlob’s Default Sentiment Analyzer 


One of the most common and valuable NLP tasks is sentiment analysis, which determines 
whether text is positive, neutral or negative. For instance, companies might use this to 
determine whether people are speaking positively or negatively online about their products. 
Consider the positive word “good” and the negative word “bad.” Just because a sentence 
contains “good” or “bad” does not mean the sentence’s sentiment necessarily is positive or 


negative. For example, the sentence 

The food is not good. 

clearly has negative sentiment. Similarly, the sentence 

The movie was not bad. 

clearly has positive sentiment, though perhaps not as positive as something like 


The movie was excellent! 








Sentiment analysis is a complex machine-learning problem. However, libraries like TextBlob 


have pretrained machine learning models for performing sentiment analysis. 


Getting the Sentiment of a TextBlob 





A TextBlob’s sentiment property returns a Sentiment object indicating whether the 


text is positive or negative and whether it’s objective or subjective: 


lick here to view code image 


we blob 
üt [1l]: TextBLobp(TToday is a beautiful day. Tomorrow looks like bad weather.' 


In [12]: blob.sentiment 
Out[12]: Sentiment (polarity=0.07500000000000007, 
subjectivity=0 .8333333333333333) 











n the preceding output, the polarity indicates sentiment with a value from -1 . 0 


(negative) to 1 . 0 (positive) with 0 . 0 being neutral. The subjectivity is a value from 0.0 





(objective) to 1.0 (subjective). Based on the values for our TextB1lob, the overall sentiment is 


close to neutral, and the text is mostly subjective. 


Getting the polarityand subjectivity from the Sentiment Object 


The values displayed above probably provide more precision that you need in most cases. 
This can detract from numeric output’s readability. The [Python magic precision allows 
you to specify the default precision for standalone float objects and float objects in built- 
in types like lists, dictionaries and tuples. Let’s use the magic to round the polarity and 


subjectivity values to three digits to the right of the decimal point: 


lick here to view code image 


n S3)% Sprecision 3 
Ce oe aoe 


n 4]: blob.sentiment.polarity 
wekane 





n 5]: blob.sentiment.subjectivity 
Out 2 10E833) 











ol 


Getting the Sentiment of a Sentence 


You also can get the sentiment at the individual sentence level. Let’s use the sentence 
property to get alist of Sentence objects, then iterate through them and display each 
Sentence’s sentiment property: 
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ttp://textblob.readthedocs.io/en/latest/api_reference.html#textblob.blob.Sen 


lick here to view code image 
In [16]: for sentence in blob.sentences: 
print (sentence.sentiment) 


Sentiment (polarity=0.85, subjectivity=1.0) 
Sentiment (polarity=-0.6999999999999998, subjectivity=0.6666666666666666) 





This might explain why the entire Text Blob’s sentiment is close to 0 . 0 (neutral)—one 
sentence is positive (0 . 85) and the other negative (-0 . 6999999999999998), 


11.2.6 Sentiment Analysis with the NaiveBayesAnalyzer 





By default, a Text Blob and the Sentences and Words you get from it determine sentiment 


using a PatternAnalyzer, which uses the same sentiment analysis techniques as in the 
Pattern library. The TextBlob library also comes with a NaiveBayesAnalyzer ° (module 
text-blob.sentiments), which was trained on a database of movie reviews. Naive 


Bayes ° is a commonly used machine learning text-classification algorithm. The following 





uses the analyzer keyword argument to specify a Text Blob’s sentiment analyzer. Recall 
from earlier in this ongoing IPython session that text contains 'Today is a beautiful 


day. Tomorrow looks like bad weather.!: 


9 


ttps://textblob.readthedocs.io/en/latest/api_reference.html#module- 


extblob-.en.sentiments. 


g 





ttps://en.wikipedia.org/wiki/Naive Bayes classifier. 


lick here to view code image 


TA ee TAIN reom textblob.sentiments import NaiveBayesAnalyzer 
In [18]: blob = TextBlob(text, analyzer=NaiveBayesAnalyzer()) 


mn? FESI sob: 
Out[19]: TextBlob("Today is a beautiful day. Tomorrow looks like bad weather. 























et’s use the Text Blob’s sentiment property to display the text’s sentiment using the 





NaiveBayesAnalyzer 


lick here to view code image 


In [20]: blob.sentiment 
Out[20]: Sentiment (classification='neg', p_pos=0.47662917962091056, P neg=0.5 








a| ] > 











n this case, the overall sentiment is classified as negative (classification='neg'). The 





Sentiment object’s p_pos indicates that the TextB1ob is 47.66% positive, and its p_neg 


indicates that the TextBlob is 52.34% negative. Since the overall sentiment is just slightly 








more negative we’d probably view this Text Blob’s sentiment as neutral overall. 
Now, let’s get the sentiment of each Sentence: 


lick here to view code image 
In [21]: for sentence in blob.sentences: 
print (sentence.sentiment) 


Sentiment (classification="pos", P pos=0.. 811 7563921751951), p_neg=0.18824368782 
entiment (classification='neg', p_pos=0.174363226578349, p_neg=0.825636773421 








4] > 











otice that rather than polarity and subjectivity, the Sentiment objects we get from 





the NaiveBayesAnalyzer contain a classification—'pos' (positive) or 'neg' (negative)— 


and p_pos (percentage positive) and p_neg (percentage negative) values from 0.0 to 1.0. 


Once again, we see that the first sentence is positive and the second is negative. 


11.2.7 Language Detection and Translation 


Inter-language translation is a challenging problem in natural language processing and 
artificial intelligence. With advances in machine learning, artificial intelligence and natural 
language processing, services like Google Translate (100+ languages) and Microsoft Bing 
Translator (60+ languages) can translate between languages instantly. 


Inter-language translation also is great for people traveling to foreign countries. They can use 
translation apps to translate menus, road signs and more. There are even efforts at live 
speech translation so that you'll be able to converse in real time with people who do not know 
your natural language.-” * Some smartphones, can now work together with in ear 
headphones to provide near-live translation of many languages. 3, 4, ° In the “IBM Watson 
and Cognitive Computing” chapter, we develop a script that does near real-time inter- 
language translation among languages supported by Watson. 


ttps://www.skype.com/en/features/skype-translator/. 





ttps://www.microsoft.com/en-us/translator/business/live/. 





3 ttps://www.telegraph.co.uk/technology/2017/10/04/googles-new- 


eadphones-can-translate-foreign-languages-real/. 


4 ttps://store.google.com/us /product/google pixel buds?hl=en-US. 





5 ttp://www.chicagotribune.com/bluesky/originals/ct-bsi-google- 


ixel-buds-review-20171115-story.html. 





The TextBlob library uses Google Translate to detect a text’s language and translate 





TextBlobs, Sentences and Words into other languages.  Let’s use detect_language 


method to detect the language of the text we’re manipulating ('en' is English): 
These features require an Internet connection. 


lick here to view code image 
in [224s blob 
Out[22]: TextBlob("Today is a beautiful day. Tomorrow looks like bad weather. 


n [23]: blob detect language) 
OU ZSi Teni 








Next, let’s use the translate method to translate the text to Spanish (' es ') then detect 


the language on the result. The to keyword argument specifies the target language. 


lick here to view code image 


In [24]: spanish = blob.translate(to='es"') 


Im [25]; spanish 


Out [25]: TextBlob("Hoy es un hermoso dia. Mañana parece mal tiempo.") 


In [26]: spanish.detect_language() 
Cub Zoli Tes! 





Next, let’s translate our TextBlob to simplified Chinese (specified as 'zh' or 'zh-CN') 


then detect the language on the result: 


lick here to view code image 


In [27]: chinese = blob.translate(to='zh') 


In [28]: chinese 
Out (Zi) + LextBlobi(™”) 


In [29]: chinese.detect_language() 
Owe PZA9i) i Z= EN? 








Method detect_language’s output always shows simplified Chinese as 'zh-CN', even 


though the translate function can receive simplified Chinese as 'zh' or 'zh-CN'. 


In each of the preceding cases, Google Translate automatically detects the source language. 
You can specify a source language explicitly by passing the from_lang keyword argument to 


the translate method, as in 


lick here to view code image 


chinese = blob. translate (trom Jang="en",, to= zh") 


Google Translate uses iso-639-1 ” language codes listed at 
7 SO is the International Organization for Standardization ( ttps: //www.iso.org/). 


ttps://en.wikipedia.org/wiki/List_of_ISO_639-1_codes 





For the supported languages, you'd use these codes as the values of the from_lang and to 


keyword arguments. Google Translate’s list of supported languages is at: 
ttps://cloud.google.com/translate/docs/languages 


Calling translate without arguments translates from the detected source language to 
English: 


lick here to view code image 


In [20% spanish. translate) 
Out[30]: TextBlob("Today is a beautiful day. Tomorrow seems like bad weather. 
n [31]: chinese.translate() 


Out[31]: TextBlob("Today is a beautiful day. Tomorrow looks like bad weather. 





4 | > 





ote the slight difference in the English results. 


11.2.8 Inflection: Pluralization and Singularization 


2 


Inflections are different forms of the same words, such as singular and plural (like “person’ 
and “people”) and different verb tenses (like “run” and “ran”). When youre calculating word 
frequencies, you might first want to convert all inflected words to the same form for more 
accurate word frequencies. Words and WordLists each support converting words to their 


singular or plural forms. Let’s pluralize and singularize a couple of Word objects: 


lick here to view code image 





in [lj from textblob import Word 
n [2]: index = Word('index"') 

n [3]: index.pluralize() 

Curl sis “iadrees* 

In [4% cacti = Wordi cacti") 

Er “Polly cacti- singularize) 
Outs: “cactus’ 








Pluralizing and singularizing are sophisticated tasks which, as you can see above, are not as 


oa) 


simple as adding or removing an “s” or “es” at the end of a word. 
You can do the same with aWordList: 


lick here to view code image 


In [6]: from textblob import TextBlob 

In [7]: animals = TextBlob('dog cat fish oid") “words 
In [8]: animals.pluralize() 

OME Pls Wordims EG iMdogs),.. Vcaltts,. SEIS = Woaadist ih) 


Note that the word “fish” is the same in both its singular and plural forms. 


11.2.9 Spell Checking and Correction 


For natural language processing tasks, it’s important that the text be free of spelling errors. 
Software packages for writing and editing text, like Microsoft Word, Google Docs and others 
automatically check your spelling as you type and typically display a red line under 
misspelled words. Other tools enable you to manually invoke a spelling checker. 


You can check a Word’s spelling with its spellcheck method, which returns a list of tuples 
containing possible correct spellings and a confidence value. Let’s assume we meant to type 
the word “they” but we misspelled it as “theyr.” The spell checking results show two possible 


corrections with the word 'they' having the highest confidence value: 


lick here to view code image 


In [1]: from textblob import Word 


In [2]: word = Word('theyr') 


In [S] sprecision 2 
Oe Sila Sree iY 


In [4]: word.spellcheck () 


outa PCvehey™, O.o7), (‘there 0.43) ] 


Note that the word with the highest confidence value might not be the correct word for the 


given context. 





TextBlobs, Sentences and Words all have a correct method that you can call to correct 
spelling. Calling correct on a Word returns the correctly spelled word that has the highest 


confidence value (as returned by spellcheck): 


lick here to view code image 


In [5]: word.correct() t chooses word with the highest confidence value 
Out [Si] 5 “they’ 





Calling correct ona TextBlob or Sentence checks the spelling of each word. For each 
incorrect word, correct replaces it with the correctly spelled one that has the highest 


confidence value: 


lick here to view code image 


In [6] ¢ from textblob import Word 


In [7]: sentence = TextBlob('Ths sentense has missplled wrds.') 
In [8]: sentence.correct () 
Out[8]: TextBlob("The sentence has misspelled words.") 


11.2.10 Normalization: Stemming and Lemmatization 


Stemming removes a prefix or suffix from a word leaving only a stem, which may or may 
not be a real word. Lemmatization is similar, but factors in the word’s part of speech and 


meaning and results in a real word. 


Stemming and lemmatization are normalization operations, in which you prepare words 
for analysis. For example, before calculating statistics on words in a body of text, you might 
convert all words to lowercase so that capitalized and lowercase words are not treated 
differently. Sometimes, you might want to use a word’s root to represent the word’s many 
forms. For example, in a given application, you might want to treat all of the following words 
as “program”: program, programs, programmer, programming and programmed (and 


perhaps U.K. English spellings, like programmes as well). 


Words and WordLists each support stemming and lemmatization via the methods stem and 


lemmatize. Let’s use both on a Word: 


lick here to view code image 


In [1]: from textblob import Word 


In [2]: word = Word('varieties') 


In [3]: word.stem() 








Outils]: *variterL’ 
In [4]: word. lemmatize() 
Out [4 ‘variety' 


11.2.11 Word Frequencies 


Various techniques for detecting similarity between documents rely on word frequencies. As 





you'll see here, Text Blob automatically counts word frequencies. First, let’s load the e-book 





for Shakespeare’s Romeo and Juliet into a Text Blob. To do so, we'll use the Path class 
from the Python Standard Library’s pathlib module: 


lick here to view code image 


in [lis Erom pathlio import Path 
In [2] 3 from textblob import TextBillob 


In [3]: blob = TextBlob(Path('RomeoAndJuliet.txt') .read_text()) 


Use the file RomeoAndJuliet.txt È that you downloaded earlier. We assume here that you 
started your [Python session from that folder. When you read a file with Path’s read_text 


method, it closes the file immediately after it finishes reading the file. 


8kach Project Gutenberg e-book includes additional text, such as their licensing information, 
thats not part of the e-book itself. For this example, we used a text editor to remove that text 


from our copy of the e-book. 





You can access the word frequencies through the Text Blob’s word_counts dictionary. 


Let’s get the counts of several words in the play: 


lick here to view code image 


In [4]: blob.word_counts['juliet'] 
Out ee) 


D 


In [5]: blob.word_counts['romeo'] 
Out [5 5 


Ww 











In [6]: blob.word_counts['thou'] 
Out[6]: 278 











If you already have tokenized a Text Blob into a WordList, you can count specific words in 


the list via the count method: 


lick here to view code image 


mm [vis blob. words count” joy") 
Out[7]: 14 


In [8]: blob.noun_phrases.count ('lady capulet') 


Out[8]: 46 


11.2.12 Getting Definitions, Synonyms and Antonyms from WordNet 


WordNet °? is a word database created by Princeton University. The TextBlob library uses 
the NLTK library’s WordNet interface, enabling you to look up word definitions, and get 
synonyms and antonyms. For more information, check out the NLTK WordNet interface 


documentation at: 
? ttps://wordnet.princeton.edu/. 


ttps://www.nitk.org/api/nltk.corpus.reader.html#module-nltk.corpus.reader.wordnet 


Getting Definitions 


First, let’s create a Word: 


lick here to view code image 


In [1]: from textblob import Word 


In [2]: happy = Word('happy') 


The Word class’s definitions property returns a list of all the word’s definitions in the 


WordNet database: 


lick here to view code image 


in [silt happy derini trons 

ontlas 

[enjoying or showing or marked by joy or pleasure", 
"marked by good fortune", 
‘eagerly disposed to act or to be of service', 
‘well expressed and to the point'] 


The database does not necessarily contain every dictionary definition of a given word. There’s 
also a define method that enables you to pass a part of speech as an argument so you can 


get definitions matching only that part of speech. 
Getting Synonyms 


You can get a Word’s synsets—that is, its sets of synonyms—via the synsets property. 
The result is a list of Synset objects: 


In [4]: happy.synsets 


Out [4]: 
[Synset(*happy.a.01"), 
Synset (*felileitovs.s.02" i, 
Synset(*clac.s.02" i; 
Synset ('happy.s.04') ] 


Each Synset represents a group of synonyms. In the notation happy.a.01: 


e happy is the original Word’s lemmatized form (in this case, it’s the same). 


e ais the part of speech, which can be a for adjective, n for noun, v for verb, r for adverb or 
s for adjective satellite. Many adjective synsets in WordNet have satellite synsets that 


represent similar adjectives. 


e 01 isa oO-based index number. Many words have multiple meanings, and this is the index 


number of the corresponding meaning in the WordNet database. 


There’s also a get_synsets method that enables you to pass a part of speech as an 


argument so you can get Synsets matching only that part of speech. 


You can iterate through the synsets list to find the original word’s synonyms. Each Synset 
has a lemmas method that returns a list of Lemma objects representing the synonyms. A 
Lemma’s name method returns the synonymous word as a string. In the following code, for 


each Synset inthe synsets list, the nested for loop iterates through that Synset’s 





Lemmas (if any). Then we add the synonym to the set named synonyms. We used a set 


collection because it automatically eliminates any duplicates we add to it: 


lick here to view code image 


In [5]: synonyms = set() 


In [6]: for synset in happy.synsets: 
for lemma in synset.lemmas(): 


synonyms.add(lemma.name () ) 


In [7]: synonyms 
QOurl(V)e { heltertous", "Gladi; “happy’; “well-chosen” } 
Getting Antonyms 


If the word represented by a Lemma has antonyms in the WordNet database, invoking the 
Lemma’s antonyms method returns a list of Lemmas representing the antonyms (or an empty 
list if there are no antonyms in the database). In snippet [4] you saw there were four 
Synsets for 'happy'. First, let’s get the Lemmas for the Synset at index 0 of the synsets 
list: 


lick here to view code image 


In [8]: lemmas = happy.synsets[0].lemmas () 
In [9]: lemmas 
Out[9]: [Lemma('happy.a.0l.happy') ] 





In this case, lemmas returned a list of one Lemma element. We can now check whether the 


database has any corresponding antonyms for that Lemma: 


lick here to view code image 


In [10]: lemmas[0].antonyms() 


Out[10]: [Lemma ('unhappy.a.01-.unhappy') ] 


The result is list of Lemmas representing the antonym(s). Here, we see that the one antonym 


for 'happy' in the database is 'unhappy’. 


11.2.13 Deleting Stop Words 


Stop words are common words in text that are often removed from text before analyzing it 
because they typically do not provide useful information. The following table shows NLTK’s 


list of English stop words, which is returned by the NLTK stopwords module’s words 


function ° (which we'll use momentarily): 


° ttps://www.nltk.org/book/ch02.html. 


NLTK’s English stop words list 





aty 


eL 


about , “above!  “alktenn! 


vami anu 


‘he', 


eune Yen’, verce, 


Vase Tati, "because', 
"between', 'both', 


Tel", ‘Vela! "egia y 


"been', 


Selom y sowe hy Wong! 


Nerolcla Eu 
"doesn', “doesn't", 


waong” , Voom’ , 


‘each', "further' 


“hadn' eu 


Crewin  Varene! ECOM, 
masay 


Deu 


masni “iakeshare Ms, 


Mavenvaliovey! p "her', 'here', 'hers', 


Vinimeeulie’, “ais, “Inoww%, Yai", 


Melee 


tagaim, 
"before', 
Daid ie! 
Selon ' TU 
7 hadi, 


'have', 


CIm 


‘aren', 


mp canu 


mou 


'haven', 
"herself', 


PTE 








Salem ew, Vane’, Sateve, eeen ieSe EA 


Mma! , Mme’, meinen e Sugarelmiria ie , 


Saneysviciay / eM "'myself', 


VoiEY, VORE y 


myy "needn', 


Vio, “Vievour’ , Voy" ional Yn 


Vothert, Nout ours! ourselves", 


teen, "sc, “same, 
Ueilevorllcia! - 


Schar! LLW, 


voham silaeia Me 


Re noviidime NeilaverulILclins HEN 


Vey “eliant; Vebat? p Henst, 


'them', 'themselves', 'then', 'there', 


Melis’, Vehese!, VEM OCCA Virol, Vicoo! 


"uo, Ve; Vwery!, “was, VESH p 


"weren', “weren't", '‘what', 'when', 


TWN", “glee! , Vwlesy’, LL Vigaliela! , 


ovde vwowllkeinie, Yi, VO - 


“woul we”, ‘wou, ‘wours!, Ywourmsedlie! , 


The NLTK library has lists of stop words for several other natural languages as well. Before 


"more', 
“needn't", 
‘once', 
touti 
'she', 


oun 


pe under", 
“wasnt e, 

'where', 
S 


“you 1 C 


JEUSE“ y 


UNOSE y 


Tover, 

“she's", 
"some', 
'their', 


"these', 


Iwen, 


Dyou W 


"yourselves' ] 


tagamist 


SONAS y 


vigt, 


CLE, 


HOn 


Vomlla7? y 


UyglnaLele.” 


“won 1 CY 7 


Crn, 


Vawrent ey 
"being', 
We @uelclinwey 


"does', 


Uelbtresbiave;” 


hadmi 
“haven't", 


Masking! y 


ESA 


Emo 7 


masten, 


OCU, 


Were uim 


‘own', 


Y siniolel 


WSC ine, 
Wehea y 
Heey" p 
Monae a IU 


'were', 


"while', 


“you 1 peou 7 


using NLTK’s stop-words lists, you must download them, which you do with the nitk 


module’s download function: 


lick here to view code image 


in [lis anpert: nitk 


in [Zi nitek- download stopwords) 

[nltk_ data] Downloading package stopwords to 

[nltk_data] C:\Users\PaulDeitel\AppData\Roaming\nltk data... 
[nltk_data] Unzipping corpora\stopwords.zip. 

Ouüt[2]: True 


For this example, we'll load the 'english' stop words list. First import stopwords from 
the nltk.corpus module, then use stopwords method words to load the 'english' 


stop words list: 


lick here to view code image 


Ta [sii from niltk:corpus import stopwords 


In [4]: stops = stopwords- words ("english") 





Next, let’s create a TextBlob from which we'll remove stop words: 


lick here to view code image 


In [5]: from textblob import TextBlob 
In [6]: blob = TextBlob('Today is a beautiful day.') 





Finally, to remove the stop words, let’s use the Text Blob’s words in a list comprehension 


that adds each word to the resulting list only if the word is not in stops: 


lick here to view code image 


tTa ie [word tom sword aniblob.words rE word nok ain stops] 
QUE? today’, Voeautitule, day" 


11.2.14 n-grams 


An n-gram * is a sequence of n text items, such as letters in words or words in a sentence. In 
natural language processing, n-grams can be used to identify letters or words that frequently 
appear adjacent to one another. For text-based user input, this can help predict the next 
letter or word a user will type—such as when completing items in IPython with tab- 
completion or when entering a message to a friend in your favorite smartphone messaging 
app. For speech-to-text, n-grams might be used to improve the quality of the transcription. 
N-grams are a form of co-occurrence in which words or letters appear near each other in a 
body of text. 


ttps://en.wikipedia.org/wiki/N-gram. 








TextBlob’s ngrams method produces a list of WordList n-grams of length three by 


default—known as trigrams. You can pass the keyword argument n to produce n-grams of 
any desired length. The output shows that the first trigram contains the first three words in 
the sentence ('Today', 'is' and 'a'). Then, ngrams creates a trigram starting with the 
second word ('is', 'a' and 'beautiful') and so on until it creates a trigram containing 


the last three words in the Text Blob: 





lick here to view code image 

















n [1]: from textblob import TextBlob 
n lale text = "Today 1S a beautiful day. Tomorrow looks like bad weather.' 
n [3]: blob = TextBlob (text) 
n [4]: blob.ngrams () 
Out [4]: 
nondriisti(ikroday ts ELS SaN 
Weordhisti([tts", “at, DeautitfuLl lyy 
WorditsE (ikan; “peat hud “daly ip); 
Wordist (| "beautiful, “day", “Tomorrow” ||); 
WordiasEe (iMday,. “Lomornow!,, “looks i), 
Wordhist( i tomorrow’, “Looks, “Likeri 
Wordhist( [looks “iikevn “bad, 
WordList(['like', 'bad', '‘weather']) ] 
4 > 


The following produces n-grams consisting of five words: 


lick here to view code image 





In [5]: blob.ngrams (n=5) 

Out [5 

[Womdiatse( t today 7) Mist) tan; peatuti tul sda) 
Words tiU EISi Valls peanut IEUS adayi Shomon inom), 
WordList(['a peanti EUM = day; lomornow.! 7 “looks i), 
Wordhise(| beautifult "day", “Tomorrow, “liceks!, PUKE! iin 
Wordhssti(imMday., 'Lomornow', “Looks, Tikei Spadini 
WordList(['Tomorrow', 'looks', 'like', 'bad', '‘weather']) ] 


11.3 VISUALIZING WORD FREQUENCIES WITH BAR CHARTS 
AND WORD CLOUDS 


Earlier, we obtained frequencies for a few words in Romeo and Juliet. Sometimes frequency 
visualizations enhance your corpus analyses. There’s often more than one way to visualize 

data, and sometimes one is superior to others. For example, you might be interested in word 
frequencies relative to one another, or you may just be interested in relative uses of words in 


a corpus. In this section, we'll look at two ways to visualize word frequencies: 


e Abar chart that quantitatively visualizes the top 20 words in Romeo and Juliet as bars 


representing each word and its frequency. 


e A word cloud that qualitatively visualizes more frequently occurring words in bigger 
fonts and less frequently occurring words in smaller fonts. 


11.3.1 Visualizing Word Frequencies with Pandas 


Let’s visualize Romeo and Juliet’s top 20 words that are not stop words. To do this, we'll use 
features from TextBlob, NLTK and pandas. Pandas visualization capabilities are based on 


Matplotlib, so launch [Python with the following command for this session: 


Ipython ==matp lot lib 


Loading the Data 


First, let’s load Romeo and Juliet. Launch [Python from the ch11 examples folder before 
executing the following code so you can access the e-book file RomeoAndJuliet.txt that 


you downloaded earlier in the chapter: 


lick here to view code image 


Ta Mi Erom pathlib Import Rathi 

In [2]: from textblob import TextBlob 

In [3]: blob = TextBlob(Path('RomeoAndJuliet.txt').read_text()) 
Next, load the NLTK stopwords: 


lick here to view code image 


In [4]% from niltk.corpus import stopwords 


In [5]: stop words = stopwords.words('english') 


Getting the Word Frequencies 


To visualize the top 20 words, we need each word and its frequency. Let’s call the 
blob.word_counts dictionary’s items method to get a list of word-frequency tuples: 


lick here to view code image 


In [6]: items = blob.word_counts.items () 


Eliminating the Stop Words 


Next, let’s use a list comprehension to eliminate any tuples containing stop words: 


lick here to view code image 
In [7]s items = [item for item in items if itemi not in stop words] 


The expression item[0] gets the word from each tuple so we can check whether it’s in 


stop words. 


Sorting the Words by Frequency 


To determine the top 20 words, let’s sort the tuples in items in descending order by 


frequency. We can use built-in function sorted with a key argument to sort the tuples by 
the frequency element in each tuple. To specify the tuple element to sort by, use the 


itemgetter function from the Python Standard Library’s operator module: 


lick here to view code image 


In [8]: from operator import itemgetter 


In [9]: sorted_items = sorted(items, key=itemgetter(1), reverse=True) 


As sorted orders items’ elements, it accesses the element at index 1 in each tuple via the 
expression itemgetter (1).The reverse=True keyword argument indicates that the 


tuples should be sorted in descending order. 


Getting the Top 20 Words 





Next, we use a slice to get the top 20 words from sorted_items. When TextBlob 
tokenizes a corpus, it splits all contractions at their apostrophes and counts the total number 
of apostrophes as one of the “words.” Romeo and Juliet has many contractions. If you display 
sorted items [0], you'll see that they are the most frequently occurring “word” with 8 67 
of them. * We want to display only words, so we ignore element 0 and get a slice containing 


elements 1 through 20 of sorted_items: 
7In some locales this does not happen and element o is indeed 'romeo'. 


lick here to view code image 


in, [LO] wop20 = sonted tems 1:21] 


Convert top20 toa DataFrame 


Next, let’s convert the top20 list of tuples to a pandas DataFrame so we can visualize it 


conveniently: 
lick here to view code image 
In [11]: import pandas as pd 


In [12]: df = pd.DataFrame(top20, columns=['word', 'count']) 


To S]e etcle 
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word count 
0 romeo ey) 
il thou 278 
2 juliet 90 
3 thy 70 
4 capulet 63 
5 nurse 49 
6 love 48 
T thee 38 
8 lady 17 
g shall 10 
10 Feidt 05 
akal come 94 





2 mercutio 88 
3 lawrence 82 
4 good 80 
5 benvolio 79 
6 tybalt 719 
7 enter 15 
8 go Wis) 
9 night re 





Visualizing the DataFrame 


To visualize the data, we'll use the bar method of the DataFrame’s plot property. The 
arguments indicate which column’s data should be displayed along the x- and y-axes, and 
that we do not want to display a legend on the graph: 


lick here to view code image 
In [14]: axes = df.plot.bar(x='word', y='count', legend=False) 


The bar method creates and displays a Matplotlib bar chart. 


When you look at the initial bar chart that appears, you'll notice that some of the words are 
truncated. To fix that, use Matplotlib’s gcf (get current figure) function to get the Matplotlib 
figure that pandas displayed, then call the figure’s tight _ layout method. This compresses 


the bar chart to ensure all its components fit: 


lick here to view code image 


Ta TSIs Import matplotlib.pyplot as pit 
im, [Me]: pit gcei tight layout) 


The final graph is shown below: 


Figure 1 
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11.3.2 Visualizing Word Frequencies with Word Clouds 


Next, we'll build a word cloud that visualizes the top 200 words in Romeo and Juliet. You can 
use the open source wordcloud module’s * WordCloud class to generate word clouds 
with just a few lines of code. By default, wordcloud creates rectangular word clouds, but as 


you'll see the library can create word clouds with arbitrary shapes. 
3 ttps://github.com/amueller/wo rd cloud. 


Installing the wordcloud Module 


To install wordcloud, open your Anaconda Prompt (Windows), Terminal (macOS/Linux) or 


shell (Linux) and enter the command: 
conda install -c conda-forge wordcloud 


Windows users might need to run the Anaconda Prompt as an Administrator for proper 
software installation privileges. To do so, right-click Anaconda Prompt in the start menu and 


select More > Run as administrator. 


Loading the Text 


First, let’s load Romeo and Juliet. Launch IPython from the ch11 examples folder before 
executing the following code so you can access the e-book file RomeoAndJuliet.txt you 


downloaded earlier: 


lick here to view code image 


im s from pathilib import Path 


In [2]: text = Path("‘RomeoAndJuliet.txt”) .read_text() 


Loading the Mask Image that Specifies the Word Cloud’s Shape 


To create a word cloud of a given shape, you can initialize a WordCloud object with an image 
known as a mask. The WordCloud fills non-white areas of the mask image with text. We'll 
use a heart shape in this example, provided as mask_heart.png in the ch11 examples 


folder. More complex masks require more time to create the word cloud. 


Let’s load the mask image by using the imread function from the imageio module that 


comes with Anaconda: 


lick here to view code image 


In Tals import imnegeio 


In [4]: mask image = imageio.imread('mask heart.png') 
This function returns the image as a NumPy array, which is required by WordCloud. 


Configuring the WordCloud Object 


Next, let’s create and configure the WordCloud object: 


lick here to view code image 


In [5]: from wordcloud import WordCloud 


In [6]: wordcloud = WordCloud(colormap='prism', mask=mask_image, 
background color=" white") 


The default wordCloud width and height in pixels is 400x200, unless you specify width and 
height keyword arguments or a mask image. For a mask image, the WordCloud size is the 
image’s size. WordCloud uses Matplotlib under the hood. WordCloud assigns random colors 
from a color map. You can supply the colormap keyword argument and use one of 


Matplotlib’s named color maps. For a list of color map names and their colors, see: 
ttps://matplotlib.org/examples/color/colormaps_reference.html 


The mask keyword argument specifies the mask_ image we loaded previously. By default, the 
word is drawn on a black background, but we customized this with the background_color 
keyword argument by specifying a 'white' background. For a complete list of WordCloud’s 


keyword arguments, see 


ttp://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html 


Generating the Word Cloud 


WordCloud’s generate method receives the text to use in the word cloud as an argument 


and creates the word cloud, which it returns as a WordCloud object: 


lick here to view code image 
In [7]: wordcloud = wordcloud.generate (text) 


Before creating the word cloud, generate first removes stop words from the text argument 
using the wordcloud module’s built-in stop-words list. Then generate calculates the word 
frequencies for the remaining words. The method uses a maximum of 200 words in the word 
cloud by default, but you can customize this with the max_ words keyword argument. 


Saving the Word Cloud as an Image File 


Finally, we use WordCloud’s to_file method to save the word cloud image into the 


specified file: 


lick here to view code image 
In [8]: wordcloud = wordcloud.to_ file('RomeoAndJulietHeart.png') 


You can now go to the ch11 examples folder and double-click the RomeoAndJuliet.png 
image file on your system to view it—your version might have the words in different positions 


and different colors: 
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Generating a Word Cloud from a Dictionary 


If you already have a dictionary of key—value pairs representing word counts, you can pass it 
to WordCloud’s fit_words method. This method assumes you've already removed the 


stop words. 


Displaying the Image with Matplotlib 


If you’d like to display the image on the screen, you can use the IPython magic 
Smatplotlib 


to enable interactive Matplotlib support in IPython, then execute the following statements: 


lick here to view code image 


import matplotlib pyplot as pit 
plt.imshow (wordcloud) 


11.4 READABILITY ASSESSMENT WITH TEXTATISTIC 


An interesting use of natural language processing is assessing text readability, which is 
affected by the vocabulary used, sentence structure, sentence length, topic and more. While 
writing this book, we used the paid tool Grammarly to help tune the writing and ensure the 
text’s readability for a wide audience. 


In this section, we'll use the Textatistic library ‘to assess readability. ° There are many 
formulas used in natural language processing to calculate readability. Textatistic uses five 
popular readability formulas—Flesch Reading Ease, Flesch-Kincaid, Gunning Fog, Simple 
Measure of Gobbledygook (SMOG) and Dale-Chall. 





4 ttps://github.com/erinhengel/Textatistic. 


5Some other Python readability assessment libraries include readability-score, textstat, 


readability and pylinguistics. 


Install Textatistic 


To install Textatistic, open your Anaconda Prompt (Windows), Terminal (macOS/Linux) or 
shell (Linux), then execute the following command: 


pip install textatistic 


Windows users might need to run the Anaconda Prompt as an Administrator for proper 
software installation privileges. To do so, right-click Anaconda Prompt in the start menu and 
select More > Run as administrator. 


Calculating Statistics and Readability Scores 


First, let’s load Romeo and Juliet into the text variable: 
lick here to view code image 


in [is from pathir import Path 


In [2]: text = Path("RomeoAndJuliet.txt’).read text () 


Calculating statistics and readability scores requires a Textatistic object that’s initialized 


with the text you want to assess: 


lick here to view code image 


in sis from textaitrstic import Textatistie 


In [4]: readability = Textatistic (text) 


Textatistic method dict returns a dictionary containing various statistics and the 


readability scores °: 


®Each Project Gutenberg e-book includes additional text, such as their licensing information, 


thats not part of the e-book itself. For this example, we used a text editor to remove that text 


from our copy of the e-book. 


lick here to view code image 
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Each of the values in the dictionary is also accessible via a Textatistic property of the 


same name as the keys shown in the preceding output. The statistics produced include: 


e char coun 


e word coun 


e sent coun 





e sybl coun 


e notdalech 








t—The number of characters in the text. 
t—The number of words in the text. 
t—The number of sentences in the text. 
t—The number of syllables in the text. 


all_count—A count of the words that are not on the Dale-Chall list, which 


is a list of words understood by 80% of 5th graders. 7” The higher this number is compared 


to the total word count, the less readable the text is considered to be. 





7 ttp://www.readabilityformulas.com/articles/dale-chall 
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lity-word-list.php. 


Lword_count—The number of words with three or more syllables. 


e flesch_score—The Flesch Reading Ease score, which can be mapped to a grade level. 


Scores over 90 are considered readable by 5th graders. Scores under 30 require a college 


degree. Ranges in between correspond to the other grade levels. 


e fleschkincaid_score—The Flesch-Kincaid score, which corresponds to a specific 


grade level. 


e gunningfog_score—The Gunning Fog index value, which corresponds to a specific 


grade level. 


e smog_score—The Simple Measure of Gobbledygook (SMOG), which corresponds to the 
years of education required to understand text. This measure is considered particularly 


effective for healthcare materials. È 


ttps://en.wikipedia.org/wiki/SMOG. 


e dalechall_ score—The Dale-Chall score, which can be mapped to grade levels from 4 
and below to college graduate (grade 16) and above. This score considered to be most 


reliable for a broad range of text types. ° ° 


ttps://en.wikipedia.org/wiki/Readability#The Dale%E2%80%93Chall formula. 
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ttp://www.readabilityformulas.com/articles/how-do-i-decid 


hich-readability-formula-to-use.php. 


For more details on each of the readability scores produced here and several others, see 
ttps://en.wikipedia.org/wiki/Readability 
The Textatistic documentation also shows the readability formulas used: 


ttp://www.erinhengel.com/software/textatistic/ 


11.5 NAMED ENTITY RECOGNITION WITH SPACY 


NLP can determine what a text is about. A key aspect of this is named entity recognition, 
which attempts to locate and categorize items like dates, times, quantities, places, people, 
things, organizations and more. In this section, we'll use the named entity recognition 


capabilities in the spaCy NLP library ’, * to analyze text. 


1 


ttps://spacy.io/. 


?You may also want to check out Textacy ( ttps://github.com/chartbeat- 
abs/textacy)an NLP library built on spaCy that supports additional NLP tasks. 


Install spaCy 


To install spaCy, open your Anaconda Prompt (Windows), Terminal (macOS/Linux) or shell 


(Linux), then execute the following command: 


lick here to view code image 


conda install -c conda-forge spacy 


Windows users might need to run the Anaconda Prompt as an Administrator for proper 
software installation privileges. To do so, right-click Anaconda Prompt in the start menu and 


select More > Run as administrator. 


Once the install completes, you also need to execute the following command, so spaCy can 


download additional components it needs for processing English (en) text: 


lick here to view code image 


python -m spacy download en 


Loading the Language Model 


The first step in using spaCy is to load the language model representing the natural language 
of the text you’re analyzing. To do this, you'll call the spacy module’s Load function. Let’s 
load the English model that we downloaded above: 


lick here to view code image 


im iis Import spacy 


In [2]: nlp = spacy.load('en') 
The spaCy documentation recommends the variable name n1p. 


Creating a spaCy Doc 


Next, you use the nlp object to create a spaCy Doc object * representing the document to 
process. Here we used a sentence from the introduction to the World Wide Web in many of 


our books: 
3 ttps://spacy.io/api/doc 


lick here to view code image 


In [3]; document = nlp("in 1994, Tim Berners-Lee founded the ' + 
‘World Wide Web Consortium (WSC), devoted to 2 + 


‘developing web technologies') 


Getting the Named Entities 


The Doc object’s ents property returns a tuple of Span objects representing the named 
entities found in the Doc. Each Span has many properties. 4 Let’s iterate through the Spans 


and display the text and label _ properties: 
4 ttps://spacy.io/api/span. 


lick here to view code image 


In [4]: for entity in document.ents: 
printe fentity. text} fentaty. label iay 


1994: DATE 
Tim Berners-Lee: PERSON 
the World Wide Web Consortium: ORG 


Each Span’s text property returns the entity as a string, and the label_ property 
returns a string indicating the entity’s kind. Here, spaCy found three entities representing a 
DATE (1994), a PERSON (Tim Berners-Lee) and an ORG (organization; the World Wide 











Web Consortium). For more spaCy information and to take a look at its Quickstart guide, 


see 


ttps://spacy.io/usage/models#section-quickstart 


11.6 SIMILARITY DETECTION WITH SPACY 


Similarity detection is the process of analyzing documents to determine how alike they 
are. One possible similarity detection technique is word frequency counting. For example, 
some people believe that the works of William Shakespeare actually might have been written 
by Sir Francis Bacon, Christopher Marlowe or others. > Comparing the word frequencies of 
their works with those of Shakespeare can reveal writing-style similarities. 


5 ttps://en.wikipedia. org/wiki/Shakespeare authorship question. 


Various machine-learning techniques we'll discuss in later chapters can be used to study 
document similarity. However, as is often the case in Python, there are libraries such as 
spaCy and Gensim that can do this for you. Here, we'll use spaCy’s similarity detection 
features to compare Doc objects representing Shakespeare’s Romeo and Juliet with 
Christopher Marlowe’s Edward the Second. You can download Edward the Second from 


Project Gutenberg as we did for Romeo and Juliet earlier in the chapter. ê 


Each Project Gutenberg e-book includes additional text, such as their licensing information, 
thats not part of the e-book itself. For this example, we used a text editor to remove that text 


from our copies of the e-books. 


Loading the Language Model and Creating a spaCy Doc 


As in the preceding section, we first load the English model: 


lick here to view code image 


in, [AW import spacy 


In [2]: nlp = spacy.load ("en") 





Creating the spaCy Docs 


Next, we create two Doc objects—one for Romeo and Juliet and one for Edward the Second: 
lick here to view code image 
In [3]: from pathlib import Path 


In [4]: documenti = nlp(Path('RomeoAndJuliet.txt').read_text()) 


In [5]: document2 = nlp(Path('EdwardTheSecond.txt').read_text()) 


Comparing the Books’ Similarity 


Finally, we use the Doc class’s similarity method to get a value from 0.0 (not similar) to 


1.0 (identical) indicating how similar the documents are: 


lick here to view code image 


In [6]: documentl.similarity (document2) 
utleie 0.93499501L7 91000471 


spaCy believes these two documents have significant similarities. For comparison purposes, 
we created a Doc representing a current news story and compared it with Romeo and Juliet. 
As expected, spaCy returned a low value indicating little similarity between them. Try 
copying a current news article into a text file, then performing the similarity comparison 


yourself. 


11.7 OTHER NLP LIBRARIES AND TOOLS 


We've shown you various NLP libraries, but it’s always a good idea to investigate the range of 
options available to you so you can leverage the best tools for your tasks. Below are some 


additional mostly free and open source NLP libraries and APIs: 


e Gensim—Similarity detection and topic modeling. 


e Google Cloud Natural Language API—Cloud-based API for NLP tasks such as named 
entity recognition, sentiment analysis, parts-of-speech analysis and visualization, 


determining content categories and more. 
e Microsoft Linguistic Analysis API. 


e Bing sentiment analysis—Microsoft’s Bing search engine now uses sentiment in its search 
results. At the time of this writing, sentiment analysis in search results is available only in 
the United States. 


e PyTorch NLP—Deep learning library for NLP. 


e Stanford CoreNLP—A Java NLP library, which also provides a Python wrapper. Includes 


corefererence resolution, which finds all references to the same thing. 


e Apache OpenNLP—Another Java-based NLP library for common tasks, including 
coreference resolution. Python wrappers are available. 


e PyNLPI (pineapple)—Python NLP library, includes basic and more sophisticated NLP 
capabilities. 


e SnowNLP—Python library that simplifies Chinese text processing. 
e KoNLPy—Korean language NLP. 


e stop-words—Python library with stop words for many languages. We used NLTK’s stop 


words lists in this chapter. 


e TextRazor—A paid cloud-based NLP API that provides a free tier. 


11.8 MACHINE LEARNING AND DEEP LEARNING NATURAL 
LANGUAGE APPLICATIONS 


There are many natural language applications that require machine learning and deep 
learning techniques. We'll discuss some of the following in our machine learning and deep 


learning chapters: 


e Answering natural language questions—For example, our publisher Pearson Education, 
has a partnership with IBM Watson that uses Watson as a virtual tutor. Students ask 


Watson natural language questions and get answers. 


e Summarizing documents—analyzing documents and producing short summaries (also 
called abstracts) that can, for example, be included with search results and can help you 
decide what to read. 


e Speech synthesis (speech-to-text) and speech recognition (text-to-speech)—We use these 
in our “IBM Watson” chapter, along with inter-language text-to-text translation, to 


develop a near real-time inter-language voice-to-voice translator. 


e Collaborative filtering—used to implement recommender systems (“if you liked this 
movie, you might also like ”). 


e Text classification—for example, classifying news articles by categories, such as world 


news, national news, local news, sports, business, entertainment, etc. 
e Topic modeling—finding the topics discussed in documents. 
e Sarcasm detection—often used with sentiment analysis. 
e Text simplification—making text more concise and easier to read. 


e Speech to sign language and vice versa—to enable a conversation with a hearing-impaired 


person. 


e Lip reader technology—for people who can’t speak, convert lip movement to text or 
speech to enable conversation. 


e Closed captioning—adding text captions to video. 


11.9 NATURAL LANGUAGE DATASETS 


There’s a tremendous number of text data sources available to you for working with natural 


language processing: 


e Wikipedia—some or all of Wikipedia 
( ttps://meta.wikimedia.org/wiki/Datasets). 


e IMDB (Internet Movie Database)—various movie and TV datasets are available. 


e UCIs text datasets—many datasets, including the Spambase dataset. 


e Project Gutenberg—50,000+ free e-books that are out-of-copyright in the U.S. 


e Jeopardy! dataset—200,000+ questions from the Jeopardy! TV show. A milestone in AI 
occurred in 2011 when IBM Watson famously beat two of the world’s best Jeopardy! 
players. 


e Natural language processing datasets: 
ttps://machinelearningmastery.com/datasets-natural-language- 


rocessing/. 
e NLTK data: ttps://www.nltk.org/data.html. 


e Sentiment labeled sentences data set (from sources including IMDB.com, amazon.com, 
yelp.com.) 


e Registry of Open Data on AWS—a searchable directory of datasets hosted on Amazon 
Web Services ( ttps://registry.opendata.aws). 


e Amazon Customer Reviews Dataset—130+ million product reviews 


( ttps://registry.opendata.aws/amazon-reviews/). 


e Pitt.edu corpora( ttp://mpga.cs.pitt.edu/corpora/). 


11.10 WRAP-UP 


In this chapter, you performed a broad range of natural language processing (NLP) tasks 
using several NLP libraries. You saw that NLP is performed on text collections known as 
corpora. We discussed nuances of meaning that make natural language understanding 
difficult. 


We focused on the TextBlob NLP library, which is built on the NLTK and pattern libraries, 
but easier to use. You created Text Blobs and tokenized them into Sentences and Words. 








You determined the part of speech for each word in a TextBlob, and you extracted noun 


phrases. 





We demonstrated how to evaluate the positive or negative sentiment of Text Blobs and 


Sentences with the TextBlob library’s default sentiment analyzer and with the 





NaiveBayesAnalyzer. You used the TextBlob library’s integration with Google Translate 


to detect the language of text and perform inter-language translation. 


We showed various other NLP tasks, including singularization and pluralization, spell 
checking and correction, normalization with stemming and lemmatization, and getting word 
frequencies. You obtained word definitions, synonyms and antonyms from WordNet. You 
also used NLTK’s stop words list to eliminate stop words from text, and you created n-grams 


containing groups of consecutive words. 


We showed how to visualize word frequencies quantitatively as a bar chart using pandas’ 
built-in plotting capabilities. Then, we used the wordcloud library to visualize word 
frequencies qualitatively as word clouds. You performed readability assessments using the 
Textatistic library. Finally, you used spaCy to locate named entities and to perform similarity 


detection among documents. In the next chapter, you'll continue using natural language 
rocessing as we introduce data mining tweets using the Twitter APIs. 


12. Data Mining Twitter 


Objectives 
In this chapter, you'll: 


m Understand Twitter’s impact on businesses, brands, reputation, sentiment analysis, 


predictions and more. 


m Use Tweepy, one of the most popular Python Twitter API clients for data mining 


Twitter. 
m Use the Twitter Search API to download past tweets that meet your criteria. 
m Use the Twitter Streaming API to sample the stream of live tweets as they’re happening. 


m See that the tweet objects returned by Twitter contain valuable information beyond the 
tweet text. 


m Use the natural language processing techniques from the last chapter to clean and 


preprocess tweets to prepare them for analysis. 
m Perform sentiment analysis on tweets. 

mw Spot trends with Twitter’s Trends API. 

m Map tweets using folium and OpenStreetMap. 


m Understand various ways to store tweets using techniques discussed throughout this 
book. 
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12.1 INTRODUCTION 


We're always trying to predict the future. Will it rain on our upcoming picnic? Will the 
stock market or individual securities go up or down, and when and by how much? How 
will people vote in the next election? What’s the chance that a new oil exploration venture 
will strike oil and if so how much would it likely produce? Will a baseball team win more 
games if it changes its batting philosophy to “swing for the fences?” How much customer 
traffic does an airline anticipate over the next many months? And hence how should the 
company buy oil commodity futures to guarantee that it will have the supply it needs and 
hopefully at a minimal cost? What track is a hurricane likely to take and how powerful will 
it likely become (category 1, 2, 3, 4 or 5)? That kind of information is crucial to emergency 
preparedness efforts. Is a financial transaction likely to be fraudulent? Will a mortgage 


default? Is a disease likely to spread rapidly and, if so, to what geographic area? 


Prediction is a challenging and often costly process, but the potential rewards are great. 
With the technologies in this and the upcoming chapters, we'll see how AI, often in 


concert with big data, is rapidly improving prediction capabilities. 


In this chapter we concentrate on data mining Twitter, looking for the sentiment in 
tweets. Data mining is the process of searching through large collections of data, often 
big data, to find insights that can be valuable to individuals and organizations. The 
sentiment that you data mine from tweets could help predict the results of an election, the 
revenues a new movie is likely to generate and the success of a company’s marketing 
campaign. It could also help companies spot weaknesses in competitors’ product 


offerings. 


You'll connect to Twitter via web services. You'll use Twitter’s Search API to tap into the 
enormous base of past tweets. You'll use Twitter’s Streaming API to sample the flood of 
new tweets as they happen. With the Twitter Trends API, you'll see what topics are 
trending. You'll find that much of what we presented in the “ atural Language Processing 


NLP)” chapter will be useful in building Twitter applications. 


As you’ve seen throughout this book, because of powerful libraries, you'll often perform 


ignificant tasks with just a few lines of code. This is why Python and its robust open- 


source community are appealing. 


Twitter has displaced the major news organizations as the first source for newsworthy 
events. Most Twitter posts are public and happen in real-time as events unfold globally. 
People speak frankly about any subject and tweet about their personal and business lives. 
They comment on the social, entertainment and political scenes and whatever else comes 
to mind. With their mobile phones, they take and post photos and videos of events as they 
happen. You'll commonly hear the terms Twitterverse and Twittersphere to mean the 
hundreds of millions of users who have anything to do with sending, receiving and 


analyzing tweets. 


What Is Twitter? 


Twitter was founded in 2006 as a microblogging company and today is one of the most 
popular sites on the Internet. Its concept is simple. People write short messages called 
tweets, initially limited to 140 characters but recently increased for most languages to 280 
characters. Anyone can generally choose to follow anyone else. This is different from the 
closed, tight communities on other social media platforms such as Facebook, LinkedIn 


and many others, where the “following relationships” must be reciprocal. 


Twitter Statistics 


Twitter has hundreds of millions of users and hundreds of millions of tweets are sent 
every day with many thousands sent per second. * Searching online for “Internet 
statistics” and “Twitter statistics” will help you put these numbers in perspective. Some 
“tweeters” have more than 100 million followers. Dedicated tweeters generally post 
several per day to keep their followers engaged. Tweeters with the largest followings are 
typically entertainers and politicians. Developers can tap into the live stream of tweets as 
they’re happening. This has been likened to “drinking from a fire hose,” because the 


tweets come at you so quickly. 


= ttp://www.internetlivestats.com/twitter-statistics/. 


Twitter and Big Data 


Twitter has become a favorite big data source for researchers and business people 
worldwide. Twitter allows regular users free access to a small portion of the more recent 
tweets. Through special arrangements with Twitter, some third-party businesses (and 


Twitter itself) offer paid access to much larger portions the all-time tweets database. 


Cautions 


ou can’t always trust everything you read on the Internet, and tweets are no exception. 
For example, people might use false information to try to manipulate financial markets or 
influence political elections. Hedge funds often trade securities based in part on the tweet 
streams they follow, but they’re cautious. That’s one of the challenges of building 


business-critical or mission-critical systems based on social media content. 


Going forward, we use web services extensively. Internet connections can be lost, services 
can change and some services are not available in all countries. This is the real world of 
cloud-based programming. We cannot program with the same reliability as desktop apps 


when using web services. 


12.2 OVERVIEW OF THE TWITTER APIS 


Twitter’s APIs are cloud-based web services, so an Internet connection is required to 
execute the code in this chapter. Web services are methods that you call in the cloud, as 
youll do with the Twitter APIs in this chapter, the IBM Watson APIs in the next chapter 
and other APIs you'll use as computing becomes more cloud-based. Each API method has 
a web service endpoint, which is represented by a URL that’s used to invoke that method 


over the Internet. 


Twitter’s APIs include many categories of functionality, some free and some paid. Most 
have rate limits that restrict the number of times you can use them in 15-minute 
intervals. In this chapter, you'll use the Tweepy library to invoke methods from the 


following Twitter APIs: 


e Authentication API—Pass your Twitter credentials (discussed shortly) to Twitter so 


you can use the other APIs. 
e Accounts and Users API—Access information about an account. 


e Tweets API—Search through past tweets, access tweet streams to tap into tweets 


happening now and more. 


e Trends API—Find locations of trending topics and get lists of trending topics by 


location. 


See the extensive list of Twitter API categories, subcategories and individual methods at: 


ttps://developer.twitter.com/en/docs/api-reference-index.html 


Rate Limits: A Word of Caution 


witter expects developers to use its services responsibly. Each Twitter API method has a 
rate limit, which is the maximum number of requests (that is, calls) you can make 
during a 15-minute window. Twitter may block you from using its APIs if you continue to 


call a given API method after that method’s rate limit has been reached. 


Before using any API method, read its documentation and understand its rate limits. ° 
We'll configure Tweepy to wait when it encounters rate limits. This helps prevent you 
from exceeding the rate-limit restrictions. Some methods list both user rate limits and app 
rate limits. All of this chapter’s examples use app rate limits. User rate limits are for apps 
that enable individual users to log into Twitter, like third-party apps that interact with 


Twitter on your behalf, such as smartphone apps from other vendors. 
? Keep in mind that Twitter could change these limits in the future. 

For details on rate limiting, see 
ttps://developer.twitter.com/en/docs/basics/rate-limiting 

For specific rate limits on individual API methods, see 
ttps://developer.twitter.com/en/docs/basics/rate-limits 


and each API method’s documentation. 


Other Restrictions 


Twitter is a goldmine for data mining and they allow you to do a lot with their free 
services. You'll be amazed at the valuable applications you can build and how these will 
help you improve your personal and career endeavors. However, if you do not follow 
Twitter’s rules and regulations, your developer account could be terminated. 


You should carefully read the following and the documents they link to: 


e Terms of Service: ttps://twitter.com/tos 


e Developer Agreement: ttps://developer.twitter.com/en/developer- 


erms/agreement-and-policy.html 


e Developer Policy: ttps://developer.twitter.com/en/developer- 


erms/policy.html 


e Other restrictions: ttps://developer.twitter.com/en/developer- 


erms/more-on-restricted-use-cases 


ou'll see later in this chapter that you can search tweets only for the last seven days and 
get only a limited number of tweets using the free Twitter APIs. Some books and articles 
say you can get around those limits by scraping tweets directly from twitter.com. 
However, the Terms of Service explicitly say that “scraping the Services without the 


prior consent of Twitter is expressly prohibited.” 


12.3 CREATING A TWITTER ACCOUNT 


Twitter requires you to apply for a developer account to be able to use their APIs. Go to 
ttps://developer.twitter.com/en/apply-for-access 


and submit your application. You'll have to register for one as part of this process if you do 
not already have one. You'll be asked questions about the purpose of your account. You 
must carefully read and agree to Twitter’s terms to complete the application, then 


confirm your email address. 


Twitter reviews every application. Approval is not guaranteed. At the time of this writing, 
personal-use accounts were approved immediately. For company accounts, the process 


was taking from a few days to several weeks, according to the Twitter developer forums. 


12.4 GETTING TWITTER CREDENTIALS—CREATING AN 
APP 


Once you have a Twitter developer account, you must obtain credentials for interacting 
with the Twitter APIs. To do so, you'll create an app. Each app has separate credentials. 


To create an app, log into 
ttps://developer.twitter.com 


and perform the following steps: 


1. At the top-right of the page, click the drop-down menu for your account and select 
Apps. 


2. Click Create an app. 


3. Inthe App name field, specify your app’s name. If you send tweets via the API, this 
app name will be the tweets’ sender. It also will be shown to users if you create 
applications that require a user to log in via Twitter. We will not do either in this 


chapter, so a name like "YourName Test App" is fine for use with this chapter. 


. Inthe Application description field, enter a description for your app. When 
creating Twitter-based apps that will be used by other people, this would describe 
what your app does. For this chapter, you can use "Learning to use the 


Twitter API." 


5. Inthe Website URL field, enter your website. When creating Twitter-based apps, 
this is supposed to be the website where you host your app. You can use your Twitter 
URL: ttps://twitter.com/YourUserName, where YourUserName is your 
Twitter account screen name. For example, ttps://twitter.com/nasa 


corresponds to the NASA screen name @nasa. 


6. The Tell us how this app will be used field is a description of at least 100 
characters that helps Twitter employees understand what your app does. We entered 
"T am new to Twitter app development and am simply learning how 


to use the Twitter APIs for educational purposes." 


7. Leave the remaining fields empty and click Create, then carefully review the (lengthy) 


developer terms and click Create again. 


Getting Your Credentials 


After you complete Step 7 above, Twitter displays a web page for managing your app. At 
the top of the page are App details, Keys and tokens and Permissions tabs. Click the 
Keys and tokens tab to view your app’s credentials. Initially, the page shows the 
Consumer API keys—the API Key and the API secret Key. Click Create to get an 
access token and access token secret. All four of these will be used to authenticate 


with Twitter so that you may invoke its APIs. 


Storing Your Credentials 


As a good practice, do not include your API keys and access tokens (or any other 
credentials, like usernames and passwords) directly in your source code, as that would 
expose them to anyone reading the code. You should store your keys in a separate file and 


never share that file with anyone. 3 


3 Good practice would be to use an encryption library such as berypt 
( ttps://github.com/pyca/bcrypt/) to encrypt your keys, access tokens or any 
other credentials you use in your code, then read them in and decrypt them only as you 


pass them to Twitter. 


The code you'll execute in subsequent sections assumes that you place your consumer key, 


consumer secret, access token and access token secret values into the file keys . py shown 


below. You can find this file in the ch12 examples folder: 


lick here to view code image 


consumer key='YourConsumerKey' 
consumer _secret='YourConsumerSecret' 
access token="YourAccessToken’ 


access token secret="VourkhecessTokensecret * 


Edit this file, replacing YourConsumerKey, YourConsumerSecret, 
YourAccessToken and YourAccessTokenSecret with your consumer key, consumer 


secret, access token and access token secret values. Then, save the file. 


OAuth 2.0 


The consumer key, consumer secret, access token and access token secret are each part of 
the OAuth 2.0 authentication process +’ > —sometimes called the OAuth dance—that 
Twitter uses to enable access to its APIs. The Tweepy library enables you to provide the 
consumer key, consumer secret, access token and access token secret and handles the 


OAuth 2.0 authentication details for you. 
4 


ttps://developer.twitter.com/en/docs/basics/authentication/overview. 


5 ttps://oauth.net/. 


12.5 WHAT’S IN A TWEET? 


The Twitter API methods return JSON objects. JSON (JavaScript Object Notation) is 
a text-based data-interchange format used to represent objects as collections of name- 
value pairs. It’s commonly used when invoking web services. JSON is both a human- 
readable and computer-readable format that makes data easy to send and receive across 


the Internet. 


JSON objects are similar to Python dictionaries. Each JSON object contains a list of 


property names and values, in the following curly braced format: 
{propertyNamei: value1, propertyName2: value2} 
As in Python, JSON lists are comma-separated values in square brackets: 


[valuei, value2, value3| 


or your convenience, Tweepy handles the JSON for you behind the scenes, converting 
JSON to Python objects using classes defined in the Tweepy library. 


Key Properties of a Tweet Object 


A tweet (also called a status update) may contain a maximum of 280 characters, but the 
tweet objects returned by the Twitter APIs contain many metadata attributes that 


describe aspects of the tweet, such as: 
e when it was created, 
e who created it, 


e lists of the hashtags, urls, @-mentions and media (such as images and videos, which 
are specified via their URLs) included in the tweet, 


e and more. 


The following table lists a few key attributes of a tweet object: 


ttribute Description 





The creation date and time in UTC (Coordinated Universal Tim 
created aie 


format. 


Twitter extracts hashtags, urls, user mentions (that is, 

@username mentions), media (such as images and videos), 
entities ; 

symbols and polls from tweets and places them into the 


entities dictionary as lists that you can access with these key 


For tweets over 140 characters, contains details such as the twe 
extended CWeEeST P 
z full text and entities 


favorite count Number of times other users favorited the tweet. 


coordinates 


place 


id 


Wel SLE 


lang 


Ieee Siege © @jUinite 


text 


USET 


The coordinates (latitude and longitude) from which the tweet: 
sent. This is often nul1 (None in Python) because many users 


disable sending location data. 


Users can associate a place with a tweet. If they do, this will be. 

place object: 
ttps://developer.twitter.com/en/docs/tweets/d 
ictionary/overview/geo-objects#place-dictiona 


otherwise, itll be nul1 (None in Python). 


The integer ID of the tweet. Twitter recommends using id _ st: 
portability. 


The string representation of the tweet’s integer ID. 


Language of the tweet, such as 'en' for English or 'fr' for 


French. 


Number of times other users retweeted the tweet. 


The text of the tweet. If the tweet uses the new 280-character li 
and contains more than 140 characters, this property will be 

truncated and the truncated property will be set to true. This 
might also occur if a 140-character tweet was retweeted and be: 


more than 140 characters as a result. 


The User object representing the user that posted the tweet. Fo 
User object JSON properties, see: 
ttps://developer.twitter.com/en/docs/tweets/d 


ictionary/overview/user-object. 








Sample Tweet JSON 


Let’s look at sample JSON for the following tweet from the @nasa account: 


lick here to view code image 


@NoFearl075 Great question, Anthony! Throughout its seven-year mission, 


our Parker #SolarProbe spacecraft... https://t.co/xKd6ym8watT' 


We added line numbers and reformatted some of the JSON due to wrapping. Note that 
some fields in Tweet JSON are not supported in every Twitter API method; such 


differences are explained in the online documentation for each method. 


lick here to view code image 


i irercareds atik: wedi Sep OST rer Tos 345 OOOO, 2018"; 

Pa shaky lise 1037404890354606082, 

Sip” VO See se '1037404890354606082', 

4 '‘'text!: "@NoFear1075 Great question, Anthony! Throughout its seven-yea} 


mission, our Parker #SolarProbe spacecraft 
Hetp si: //it.co/xkhadoymswals, 























5 Vtrunicated : True, 

6 ‘'entities': (hashtags: (text “SolarProbet, Vundicest ss 847 951], 

7 usymbod sss Ea 

8 Iser mentions: [sereen namei: NoRear Tons 

9 "name': "Anthony Perrone’, 

10 Tidi: 28AT 

aial tid SEDI: UZ OA 397. ON 

12 tindiricess: LOTT 

13 trlst: r urit “https://t.co/xkd6ymewal™, 

14 ‘expanded _url': ‘https://twitter.com/i/web/status/ 

1037404890354606082', 

15 vdisplay Urii: 'twitter.com/i/web/status/1 

16 "indices: Le eA Ones 

Ly vsource': "<a href="http://twitter.com” rel="nofollow">Twitter Web 
Ciitent</ac t 

re inp replysto starus tidi: 103 7390542424956928; 

POY inin replytto starus 1d Stri: 110.373 9053424249569287 

20 ‘in reply to user id’: ZE4338 9791, 

21 “in reply to user id stri: U2ZC 43307 Od, 

22 “in reply to sereen mame: "NoFearlQ75", 

23 ‘user': Ces LAS 2827, 

24 meta Cling SEELE ITSA 32824 

25 "name': "NASA', 

26 "screen name': 'NASA', 

27. WAKO aus toni, ell 

28 "description': "Explore the universe and discover our home planet v 

@NASA. We usually post in EST (UTC-5)', 
29 Tige tie "https://t.co/TCEE6NS8nD', 











T 











30 entities: enr ea ar S a ibe LM NEEDS: E AEC OAT CRESN SS NDA 


31 "expanded url': "http: //wWww.nasa. gov, 








32 cdi spiiay arini: masa. gov, 

33 HmareCeEsH: KO Ae 

34 "description': uiste 

35 "protected': False, 

36 "followers counti: 29486081, 

37 LEerends Counte: 2807 

38 isted counti: 919:2 87 

39 “ereated at!: wed Dec 1972052032 T0000 2007", 
40 favourites Counti: 39637 

41 eime Zonei: None, 

42 "geo enabled: False, 

43 'verified': True, 

44 MSicacuUses count! : 53147, 

45 slangi: tend 

46 veont ributors! enabledi: False, 

47 “Es translators: False, 

48 "1s translation enabled": False, 

49 "profile background color”: "O000000', 

50 ‘profile background image url’: "http: //abs.twimg.com/images/themes 





themel/bg.png', 
































51 "profile background image url https™: "https://abs.twimg.com/images 
themes/themel/bg.png', 

52 "profile image url': "http://pbs.twimg.com/profile images/18830235:% 
nasalogo twitter normal.jpg', 

53 ‘profile image url https': "https://pbs.twimg.com/profile images/ 
188302352/nasalogo twitter normal.jpg', 

54 "profile banner url’: "https://pbs.twimg.com/profile banners/11348: 

1535145490', 

55 Uj bate neab lay iink icolori: “20. 5BAG AF, 

56 ‘pro fimke Tsadeb ar border icolor i: "O000000', 

57 proti leksidebar necbillLy toxoyllone le URS EZEZ 

58 Iprotile teem Colori: "000000', 

59 "profile use background imaget: True, 

60 "has extended profile": True, 

61 “default prori leik False, 

62 ‘default profile image': False; 

63 VEOlMOwL NG, Ys True, 

64 "follow request sentu: False, 

65 motit eat ome: = False, 

66 translator type: regulari}, 

67 'geo': None, 

68 'coordinates': None, 

69 “piace: None, 

TO “contributors”: None, 

TADS iS aute Statusi: False, 

12 retweet Counti: He 

13 favorite counti: ON 

74 'favorited': False, 

75 '‘retweeted': False, 

16 possibly isensretver’: False, 

TI. aange Kenn) 








4 > 





Twitter JSON Object Resources 


For a complete, more readable list of the tweet object attributes, see: 


ttps://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet- 





> 








or additional details that were added when Twitter moved from a limit of 140 to 280 


characters per tweet, see 


ttps://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet- 


son.html#extendedtweet 


For a general overview of all the JSON objects that Twitter APIs return, and links to the 


specific object details, see 


ttps://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet- 


son 


12.6 TWEEPY 


We'll use the Tweepy library 6 ( ttp://www.tweepy.org/)—one of the most popular 
Python libraries for interacting with the Twitter APIs. Tweepy makes it easy to access 
Twitter’s capabilities and hides from you the details of processing the JSON objects 


returned by the Twitter APIs. You can view Tweepy’s documentation ” at 


6 ther Python libraries recommended by Twitter include Birdy, python-twitter, Python 


Twitter Tools, TweetPony, TwitterAPI, twitter-gobject, TwitterSearch and twython. See 
ttps://developer.twitter.com/en/docs/developer-utilities/twitter- 


ibraries.html for details. 


7 The Tweepy documentation is a work in progress. At the time of this writing, Tweepy 
does not have documentation for their classes corresponding to the JSON objects the 
Twitter APIs return. Tweepys classes use the same attribute names and structure as the 
JSON objects. You can determine the correct attribute names to access by looking at 
Twitters JSON documentation. Well explain any attribute we use in our code and provide 


footnotes with links to the Twitter JSON descriptions. 
ttp://docs.tweepy.org/en/latest/ 
For additional information and the Tweepy source code, visit 


ttps://github.com/tweepy/tweepy 


Installing Tweepy 


To install Tweepy, open your Anaconda Prompt (Windows), Terminal (macOS/Linux) or 


shell (Linux), then execute the following command: 


pip install tweepy==3.7 


Windows users might need to run the Anaconda Prompt as an Administrator for proper 
software installation privileges. To do so, right-click Anaconda Prompt in the start menu 


and select More > Run as administrator. 


Installing geopy 


As you work with Tweepy, you'll also use functions from our tweetutilities.py file 
(provided with this chapter’s example code). One of the utility functions in that file 
depends on the geopy library ( ttps://github.com/geopy/geopy), which we'll 


discuss in ection 12.15. To install geopy, execute: 


lick here to view code image 


conda install -c conda-forge geopy 


12.7 AUTHENTICATING WITH TWITTER VIA TWEEPY 


In the next several sections, you'll invoke various cloud-based Twitter APIs via Tweepy. 
Here you'll begin by using Tweepy to authenticate with Twitter and create a Tweepy API 


object, which is your gateway to using the Twitter APIs over the Internet. In subsequent 





sections, youll work with various Twitter APIs by invoking methods on your API object. 


Before you can invoke any Twitter API, you must use your API key, API secret key, access 
token and access token secret to authenticate with Twitter. ® Launch IPython from the 
ch12 examples folder, then import the tweepy module and the keys. py file that you 
modified earlier in this chapter. You can import any . py file as a module by using the 


file’s name without the . py extension in an import statement: 


8 You may wish to create apps that enable users to log into their Twitter accounts, 
manage them, post tweets, read tweets from other users, search for tweets, etc. For details 
on user authentication see the Tweepy Authentication tutorial at 


ttp://docs.tweepy.org/en/latest/auth_ tutorial.html. 


In [1]: import tweepy 


In [~2]3 import keys 


When you import keys . py as a module, you can individually access each of the four 


variables defined in that file as keys . variable_name. 


Creating and Configuring an OAuthHand1er to Authenticate with Twitter 


Authenticating with Twitter via Tweepy involves two steps. First, create an object of the 
tweepy module’s OAuthHand1er class, passing your API key and API secret key to its 
constructor. A constructor is a function that has the same name as the class (in this 


case, OAuthHandler-) and receives the arguments used to configure the new object: 


lick here to view code image 


In [3]: auth = tweepy.OAuthHandler(keys.consumer key, 
ime [Svs auten 





tweepy.OAuthHandler(keys.consumer key, 


Specify your access token and access token secret by calling the OAuthHandler object’s 


set_access_token method: 


lick here to view code image 


in [4] auth. set access token (keys. access: token, 


keys.access token secret) 


Creating an API Object 


Now, create the API object that you'll use to interact with Twitter: 


lick here to view code image 


in (LS) api = itweepy.APL (auth, wart on rate lamit= True, 


wait on rate limit notify=True) 





We specified three arguments in this call to the API constructor: 





e auth is the OAuthHandler object containing your credentials. 


e The keyword argument wait on rate limit=True tells Tweepy to wait 15 


minutes each time it reaches a given API method’s rate limit. This ensures that you do 


not violate Twitter’s rate-limit restrictions. 


e The keyword argument wait on rate limit notify=True tells Tweepy that, if it 
needs to wait due to rate limits, it should notify you by displaying a message at the 


command line. 


Youre now ready to interact with Twitter via Tweepy. Note that the code examples in the 
next several sections are presented as a continuing [Python session, so the authorization 


process you went through here need not be repeated. 


12.8 GETTING INFORMATION ABOUT A TWITTER 
ACCOUNT 


After authenticating with Twitter, you can use the Tweepy API object’s get_user 





method to get a tweepy .models . User object containing information about a user’s 


Twitter account. Let’s get a User object for NASA’s @nasa Twitter account: 


lick here to view code image 


Tanels nasa — api get user nasan) 


The get_user method calls the Twitter API’s users/show method. ? Each Twitter 
method you call through Tweepy has a rate limit. You can call Twitter’s users/show 
method up to 900 times every 15 minutes to get information on specific user accounts. As 
we mention other Twitter API methods, we'll provide a footnote with a link to each 


method’s documentation in which you can view its rate limits. 


? ttps://developer.twitter.com/en/docs/accounts-and-users/follow- 


earch-get-users/api-reference/get-users-show. 





The tweepy.models classes each correspond to the JSON that Twitter returns. For 


example, the User class corresponds to a Twitter user object: 
ttps://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object 


Each tweepy.models class has a method that reads the JSON and turns it into an object 


of the corresponding Tweepy class. 


Getting Basic Account Information 


Let’s access some User object properties to display information about the @nasa account: 


e The id property is the account ID number created by Twitter when the user joined 


Twitter. 
e The name property is the name associated with the user’s account. 


e The screen_name property is the user’s Twitter handle (@nasa). Both the name 


and screen name could be created names to protect a user’s privacy. 


e The description property is the description from the user’s profile. 


lick here to view code image 














roae nasar rd 

OU Els 1348282 

In [8]: nasa.name 

OWE Sl: ANASA? 

In [9]: nasa. screen name 

Out [9]: "NASA" 

In [LO] nasa description 

Out[10]: 'Explore the universe and discover our home planet with @NASA. We 
4 > 








etting the Most Recent Status Update 


The User object’s status property returns a tweepy .model1s . Status object, which 


corresponds to a Twitter tweet object: 
ttps://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object 
The Status object’s text property contains the text of the account’s most recent tweet: 


lick here to view code image 


TANAN masan Sivek Sane exe 


Out[11]: 'The interaction of a high-velocity young star with the cloud of 








he text property was originally for tweets up to 140 characters. The ... above indicates 


that the tweet text was truncated. When Twitter increased the limit to 280 characters, 


they added an extended_tweet property (demonstrated later) for accessing the text 


and other information from tweets between 141 and 280 characters. In this case, Twitter 





sets text to a truncated version of the extended_tweet’s text. Also, retweeting often 
results in truncation because a retweet adds characters that could exceed the character 
limit. 

Getting the Number of Followers 


You can view an account’s number of followers with the followers count property: 


lick here to view code image 


Dn [2s masa followers Count 
Oui EAI 294153 5:4a. 


Though this number is large, there are accounts with over 100 million followers. ° 


° ttps://friendorfollow.com/twitter/most-followers/. 


Getting the Number of Friends 


Similarly, you can view an account’s number of friends (that is, the number of accounts an 


account follows) with the friends_count property: 


TATSI nasa trrends count 
Owe Kik Ais 


Getting Your Own Account’s Information 


You can use the properties in this section on your own account as well. To do so, call the 


Tweepy API object’s me method, as in: 
me = api.me() 


This returns a User object for the account you used to authenticate with Twitter in the 


preceding section. 


12.9 INTRODUCTION TO TWEEPY CURSORS: GETTING AN 
ACCOUNT’S FOLLOWERS AND FRIENDS 


When invoking Twitter API methods, you often receive as results collections of objects, 
such as tweets in your Twitter timeline, tweets in another account’s timeline or lists of 


tweets that match specified search criteria. A timeline consists of tweets sent by that user 


and by that user’s friends—that is, other accounts that the user follows. 


Each Twitter API method’s documentation discusses the maximum number of items the 
method can return in one call—this is known as a page of results. When you request more 
results than a given method can return, Twitter’s JSON responses indicate that there are 
more pages to get. Tweepy’s Cursor Class handles these details for you. A Cursor 
invokes a specified method and checks whether Twitter indicated that there is another 
page of results. If so, the Cursor automatically calls the method again to get those results. 
This continues, subject to the method’s rate limits, until there are no more results to 
process. If you configure the API object to wait when rate limits are reached (as we did), 
the Cursor will adhere to the rate limits and wait as needed between calls. The following 


subsections discuss Cursor fundamentals. For more details, see the Cursor tutorial at: 


ttp://docs.tweepy.org/en/latest/cursor tutorial.html 


12.9.1 Determining an Account’s Followers 





Let’s use a Tweepy Cursor to invoke the API object’s followers method, which calls 


the Twitter APIs fol lowers/list method *to obtain an account’s followers. Twitter 





returns these in groups of 20 by default, but you can request up to 200 at a time. For 


demonstration purposes, we'll grab 10 of NASA’s followers. 


* ttps://developer.twitter.com/en/docs/accounts-and-users/follow- 


earch-get-users/api-reference/get-followers-list. 


Method followers returns tweepy.models.User objects containing information 


about each follower. Let’s begin by creating a list in which we'll store the User objects: 


In [14]: followers = [] 


Creating a Cursor 


Next, let’s create a Cursor object that will call the followers method for NASA’s 


account, which is specified with the screen name keyword argument: 
lick here to view code image 
In [15]: cursor = tweepy.Cursor(api.followers, screen_name='nasa') 


The Cursor’s constructor receives as its argument the name of the method to call 


—api.followers indicates that the Cursor will call the api object’s followers 


method. If the Cursor constructor receives any additional keyword arguments, like 
screen name, these will be passed to the method specified in the constructor’s first 


argument. So, this Cursor specifically gets followers for the @nasa Twitter account. 


Getting Results 


Now, we can use the Cursor to get some followers. The following for statement iterates 
through the results of the expression cursor. items (10).The Cursor’s items 


method initiates the call to api . followers and returns the followers method’s 





results. In this case, we pass 10 to the items method to request only 10 results: 


lick here to view code image 
Im [Gls for account an cursor. tems(10)): 
followers.append(account.screen name) 


In [17]: print Followers: "i, 


' ' join(sorted(followers, key=lambda s: s.lower()))) 


Followers: abhinavborra BHood1976 Eshwar12341 Harish90469614 heshamkisha H 
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he preceding snippet displays the followers in ascending order by calling the built-in 
sorted function. The function’s second argument is the function used to determine how 
the elements of followers are sorted. In this case, we used a Lambda that converts every 


follower name to lowercase letters so we can perform a case-insensitive sort. 


Automatic Paging 


If the number of results requested is more than can be returned by one call to 
followers, the items method automatically “pages” through the results by making 
multiple calls to api . followers. Recall that followers returns up to 20 followers at a 
time by default, so the preceding code needs to call followers only once. To get up to 
200 followers at a time, we can create the Cursor with the count keyword argument, as 


in: 
lick here to view code image 


cursor = tweepy. Cursor (api followers, screen mame="nasa", count=200) 


If you do not specify an argument to the items method, The Cursor attempts to get all 
of the account’s followers. For large numbers of followers, this could take a significant 


amount of time due to Twitter’s rate limits. The Twitter API’s followers/list method 


an return a maximum of 200 followers at a time and Twitter allows a maximum of 15 
calls every 15 minutes. Thus, you can only get 3000 followers every 15 minutes using 
Twitter’s free APIs. Recall that we configured the API object to automatically wait when it 
hits a rate limit, so if you try to get all followers and an account has more than 3000, 
Tweepy will automatically pause for 15 minutes after every 3000 followers and display a 
message. At the time of this writing, NASA has over 29.5 million followers. At 12,000 


followers per hour, it would take over 100 days to get all of NASA’s followers. 


Note that for this example, we could have called the followers method directly, rather 
than using a Cursor, since we're getting only a small number of followers. We used a 


Cursor here to show how you'll typically call followers. In some later examples, we'll 





call API methods directly to get just a few results, rather than using Cursors. 


Getting Follower IDs Rather Than Followers 


Though you can get complete User objects for a maximum of 200 followers at a time, you 





can get many more Twitter ID numbers by calling the API objects followers ids 
method. This calls the Twitter API’s £01 lowers/ids method, which returns up to 5000 
ID numbers at a time (again, these rate limits could change). * You can invoke this 
method up to 15 times every 15 minutes, so you can get 75,000 account ID numbers per 
rate-limit interval. This is particularly useful when combined with the API object’s 
lookup-_users method. This calls the Twitter API’s users/1ookup method 3 which 
can return up to 100 User objects at a time and can be called up to 300 times every 15 
minutes. So using this combination, you could get up to 30,000 User objects per rate- 


limit interval. 
2 


ttps://developer.twitter.com/en/docs/accounts-and-users/follow- 


earch-get-users/api-reference/get-followers-ids. 


3 ttps://developer.twitter.com/en/docs/accounts-and-users/follow- 


earch-get-users/api-reference/get-users-lookup. 


12.9.2 Determining Whom an Account Follows 


The API object’s friends method calls the Twitter API’s friends/list method 4 to 





get a list of User objects representing an account’s friends. Twitter returns these in 
groups of 20 by default, but you can request up to 200 at a time, just as we discussed for 
method followers. Twitter allows you to call the friends/1ist method up to 15 times 


every 15 minutes. Let’s get 10 of NASA’s friend accounts: 


4 ttps://developer.twitter.com/en/docs/accounts-and-users/follow- 


earch-get-users/api-reference/get-friends-list. 


lick here to view code image 


In [18]: friends = [] 

in ios cussor = tweepyacursor (api eiriends, screen name= nasal) 

i Wi ZO fom -praend an cursors tems (L0): 
friends.append(friend.screen name) 

Day Pad Se qenentianen(Widiast enero ew 


U .joLmtsornted (trends; key=lambda s: s.lower()))) 


Friends: AFSpace Astro2fish Astro Kimiya AstroAnnimal AstroDuke NASA3DPrin 


4] ] > 








2.9.3 Getting a User’s Recent Tweets 


The API method user_timeline returns tweets from the timeline of a specific account. 





A timeline includes that account’s tweets and tweets from that account’s friends. The 
method calls the Twitter API’s statuses/user_ timeline method °, which returns the 
most recent 20 tweets, but can return up to 200 at a time. This method can return only an 
account’s 3200 most recent tweets. Applications using this method may call it up to 1500 


times every 15 minutes. 


5 ttps://developer.twitter.com/en/docs/tweets/timelines/api- 


eference/get-statuses-user timeline. 


Method user timeline returns Status objects with each one representing a tweet. 
Each Status’s user property refers to a tweepy.models.User object containing 
information about the user who sent that tweet, such as that user’s screen name. A 
Status’s text property contains the tweet’s text. Lets display the screen name and 


text for three tweets from @nasa: 


lick here to view code image 








a22] masa tweets =) api user  eimeliine (sicrecn name "masa, (counit=3)) 
In [23]: for tweet in nasa_tweets: 
print (f'{tweet.user.screen name}: {tweet.text}\n') 
NASA: Your Gut in Space: Microorganisms in the intestinal tract play an est 


https: / /t.ico/ uLOsUhwnSp 


NASA: We need your help! Want to see panels at @SXSW related to space explc 
https://t.co/ycqMMdGKUB 


NASA: "You are as good as anyone in this town, but you are no better than <é 
https://t.co/nhMD4n84Nf 











hese tweets were truncated (as indicated by ), meaning that they probably use the newer 


280-character tweet limit. We'll use the extended_ tweet property shortly to access full 





text for such tweets. 


In the preceding snippets, we chose to call the user timeline method directly and use 
the count keyword argument to specify the number of tweets to retrieve. If you wish to 
get more than the maximum number of tweets per call (200), then you should use a 
Cursor to call user timeline as demonstrated previously. Recall that a Cursor 


automatically pages through the results by calling the method multiple times, if necessary. 


Grabbing Recent Tweets from Your Own Timeline 





You can call the API method home_timeline, as in: 
api.home_timeline() 


to get tweets from your home timeline °—that is, your tweets and tweets from the people 
you follow. This method calls Twitter’s statuses/home_timeline method. ” By 
default, home timeline returns the most recent 20 tweets, but can get up to 200 at a 
time. Again, for more than 200 tweets from your home timeline, you should use a Tweepy 


Cursor tocall home timeline. 
°Specifically for the account you used to authenticate with Twitter. 


7 ttps://developer.twitter.com/en/docs/tweets/timelines/api- 


eference/get-statuses-home timeline. 


12.10 SEARCHING RECENT TWEETS 


The Tweepy API method search returns tweets that match a query string. According to 





the method’s documentation, Twitter maintains its search index only for the previous 
seven days’ tweets, and a search is not guaranteed to return all matching tweets. Method 
search calls Twitter’s search/tweets method p which returns 15 tweets at a time by 


default, but can return up to 100. 


i ttps://developer.twitter.com/en/docs/tweets/search/api- 


eference/get-search-tweets. 


Utility Function print tweets fromtweetutilities.py 


For this section, we created a utility function print tweets that receives the results of a 
call to API method search and for each tweet displays the user’s screen_name and the 
tweet’s text. If the tweet is not in English and the tweet .lang is not 'und' 
(undefined), we'll also translate the tweet to English using TextBlob, as you did in the 


cc 


atural Language Processing (NLP)” chapter. To use this function, import it from 


tweetutilities.py: 


lick here to view code image 
In [24] £rom tweetutiltties impor’ print tweets 


Just the print tweets function’s definition from that file is shown below: 


lick here to view code image 


def print tweets (tweets): 
"""For each Tweepy Status object in tweets, display the 
user's screen name and tweet text. If the language is not 


English, translate the text with Tesxc Blob." 








for tweet in tweets: 


Prink(£" hewect.sereen name: i, end=" ©) 





if 'en' in tweet.lang: 


print (f'{tweet.text}\n') 





elif und” not in tweet.lang: # translate to English first 
Drie (i \mw ORTERNA {tweet.text}') 
print (f'TRANSLATED: {TextBlob(tweet.text).translate()}\n') 


Searching for Specific Words 


Let’s search for three recent tweets about NASA’s Mars Opportunity Rover. The search 
method’s g keyword argument specifies the query string, which indicates what to search 


for and the count keyword argument specifies the number of tweets to return: 


lick here to view code image 


n [25]: tweets = api.search(q='Mars Opportunity Rover', count=3) 


In [26]: print _tweets (tweets) 





Jacker760: NASA set a deadline on the Mars Rover opportunity! As the dust < 
hetps!://t.co/KO7xakgrzr 





hivak32637174: RT @Gadgets360: NASA 'Cautiously Optimistic’ of Hearing Bac 
REtps:/ (e.co/ OLITEWRVEG 


ladyanakina: NASA's Opportunity Rover Still Silent on Mars. DNEtEPS:// CCO 


> 


s with other methods, if you plan to request more results than can be returned by one 


call to search, you should use a Cursor object. 


Searching with Twitter Search Operators 


You can use various Twitter search operators in your query strings to refine your search 
results. The following table shows several Twitter search operators. Multiple operators 


can be combined to construct more complex queries. To see all the operators, visit 


ttps://twitter.com/search-home 


and click the operators link. 


Finds tweets containing 





python Implicit logical and operator—Finds tweets containing 
twitter pythonand twitter. 

python OR Logical OR operator—Finds tweets containing python or 
twitter twitter or both. 


n ? (question mark)—Finds tweets asking questions about 
python ? 
python. 


- (minus sign)—Finds tweets containing planets but not 
planets -mars 
menne 


:) (happy face)—Finds positive sentiment tweets 


python g) containing python. 


: ( (sad face)—Finds negative sentiment tweets containing 








jOW eNO g 

python. 
since:2018- Finds tweets on or after the specified date, which must be 
09-01 in the form YYYY-MM-DD. 


near: "New ? 
Finds tweets that were sent near "New York City". 


Work Clty“ 
from:nasa Finds tweets from the account @nasa. 
cormasa Finds tweets to the account @nasa. 


Let’s use the from and since operators to get three tweets from NASA since September 


1, 2018—you should use a date within seven days before you execute this code: 


lick here to view code image 


In [27]: tweets = api.search(q='from:nasa since: 2018-09-01", count=3:) 
In [28]: print tweets (tweets) 
NASA: @WYSIW Our missions detect active burning fires, track the transport 


https: //t.co/ix2iUoM1 ry 
NASA: Scarring of the landscape is evident in the wake of the Mendocino Co 
https://t.co/Nboo5GD90m 
NASA: RT @NASAglenn: To celebrate the #NASA60th anniversary, we're explori 














«i > 


earching for a Hashtag 


Tweets often contain hashtags that begin with # to indicate something of importance, 


like a trending topic. Let’s get two tweets containing the hashtag #collegefootball: 


lick here to view code image 


In [29]: tweets = api.search(q='#collegefootball', count=2) 


In [30]: print _tweets (tweets) 
dmcreek: So much for #FAU giving #0U a game. #Oklahoma #FloridaAtlantic #C 





heangrychef: It's game day folks! And our BBO game is strong. #bbq #atlan 





4] | > 





2.11 SPOTTING TRENDS: TWITTER TRENDS API 


If a topic “goes viral,” you could have thousands or even millions of people tweeting about 
it at once. Twitter refers to these as trending topics and maintains lists of the trending 
topics worldwide. Via the Twitter Trends API, you can get lists of locations with trending 


topics and lists of the top 50 trending topics for each location. 


12.11.1 Places with Trending Topics 





The Tweepy APT’s trends_available method calls the Twitter API’s 
trends/available ° method to get a list of all locations for which Twitter has trending 
topics. Method trends available returns a list of dictionaries representing these 


locations. When we executed this code, there were 467 locations with trending topics: 


? ttps://developer.twitter.com/en/docs/trends/locations-with- 


rending-topics/api-reference/get-trends-available. 


lick here to view code image 


In ii: trends available = api. trends available () 
im |) 32 len (ttrends l available) 
QUEL 21.) Aor, 


The dictionary in each list element returned by trends available has various 


information, including the location’s name and woeid (discussed below): 


lick here to view code image 


im [33/2 trends avarl abeno] 


Owie LSS) 

{'name': 'Worldwide', 
‘placeType': {'code': 19, 'name': 'Supername'}, 
turi": 'http://where.yahooapis.com/vl/place/1"', 


"‘parentid’ = 0; 


ECOUMEr Yat lL, 
Ywoeaans 1, 


"countryCode': None} 


In [34]: trends _available[1] 
Outil]: 
{'name': 'Winnipeg', 
‘placeType!s { code’: 7, ‘name’: “Lown? kr 
turi": '"http://where. yahooapis.com/vl/place/2972', 
‘parentid': 23424775, 
NCcounitnys <a VCanada 7 
UWOCHO 29727 
“Country Codel: CAT} 


The Twitter Trends API’s trends/place method (discussed momentarily) uses Yahoo! 
Where on Earth IDs (WOEIDs) to look up trending topics. The WOEID 1 represents 
worldwide. Other locations have unique WOEID values greater than 1. We’ll use WOEID 
values in the next two subsections to get worldwide trending topics and trending topics for 
a specific city. The following table shows WOEID values for several landmarks, cities, 
states and continents. Note that although these are all valid WOEIDs, Twitter does not 


necessarily have trending topics for all these locations. 


Place 





Statue of Liberty 23617050 Iguazu Falls 468785 


Los Angeles, CA 2442047 United States 23424977 


Washington, D.C. 2514815 North America 24865672 


Paris, France 615702 Europe 24865675 


You also can search for locations close to a location that you specify with latitude and 
longitude values. To do so, call the Tweepy APT’s trends_closest method, which 
invokes the Twitter API’s trends/closest method. ° 


° ttps://developer.twitter.com/en/docs/trends/locations-with- 


rending-topics/api-reference/get-trends-closest. 


12.11.2 Getting a List of Trending Topics 


The Tweepy APT’s trends_place method calls the Twitter Trends API’s 
trends/place method * to get the top 50 trending topics for the location with the 
specified WOEID. You can get the WOEIDs from the woeid attribute in each dictionary 
returned by the trends available ortrends_ closest methods discussed in the 
previous section, or you can find a location’s Yahoo! Where on Earth ID (WOEID) by 


searching for a city/town, state, country, address, zip code or landmark at 


* ttps://developer.twitter.com/en/docs/trends/trends-for- 


ocation/api-reference/get-trends-place. 
ttp://www.woeidlookup.com 


You also can look up WOEID’s programmatically using Yahoo!’s web services via Python 


libraries like woeid °: 


“Youll need a Yahoo! API key as described in the woeid modules documentation. 


ttps://github.com/Ray-SunR/woeid 


Worldwide Trending Topics 


Let’s get today’s worldwide trending topics (your results will differ): 


lick here to view code image 
TARS: wortd trendsi= api ierends place (d=) 


Method trends _ place returns a one-element list containing a dictionary. The diction- 


ary’s 'trends' key refers to a list of dictionaries representing each trend: 


lick here to view code image 
TAEI: trends listi = "world trends (Oil trends] 


Each trend dictionary has name, url, promoted_content (indicating the tweet is an 


advertisement), query and tweet volume keys (shown below). The following trend is in 


Spanish—#BienvenidoSeptiembre means “Welcome September”: 


lick here to view code image 


ine ibs trends T lmstp o] 








OUEST If: 

{'name': '#BienvenidoSeptiembre', 
Mera ales "http: //twitter.com/search?q=%23BienvenidoSeptiembre', 
"promoted content’: None, 
"query': 'S$23BienvenidoSeptiembre', 





"tweet volume: 15186} 
For trends with more than 10,000 tweets, the tweet volume is the number of tweets; 


otherwise, it’s None. Let’s use a list comprehension to filter the list so that it contains only 


trends with more than 10,000 tweets: 


lick here to view code image 


in PS S erend ses e take Shore. init ronds eesti ik at [ eweer volume <J 


Next, let’s sort the trends in descending order by tweet_ volume: 


lick here to view code image 


In [39]: from operator import itemgetter 





Ine 40) trends lirst. Sont(key=1 temgercer( tweet volume), reverse—True) 


Now, let’s display the names of the top five trending topics: 


lick here to view code image 


TAMM ss how ere nd in trends mastik: ok 
print (trend['name']) 


#HBDJanaSenaniPawanKalyan 
#BackToHogwarts 

Khalil Mack 

#ItalianGP 


Alisson 





New York City Trending Topics 


Now, let’s get the top five trending topics for New York City (WOEID 2459115). The 
following code performs the same tasks as above, but for the different WOEID: 


lick here to view code image 























In [42]: nye trends = api.trends place (id=2459115) t New York City WOEID 
Pn; |e nye last = nye ueRend si (Oi) Pp trends] 
ine Ras nye liste =) [ise forse im ny cease ack el tweet volumeni 
Im [45]: nyc lust sont (key=irtemgerter( tweet volume”), reverse-True) 
TAASI foritrend in nyel iisti Sic 

print (trend['name']) 


#IDOL100M 
#TuesdayThoughts 
#HappyBirthdayLiam 
NAFTA 

#USOpen 














12.11.3 Create a Word Cloud from Trending Topics 


Inthe“ atural Language Processing” chapter, we used the WordCloud library to create 
word clouds. Let’s use it again here, to visualize New York City’s trending topics that have 
more than 10,000 tweets each. First, let’s create a dictionary of key—value pairs consisting 


of the trending topic names and tweet volumes: 
lick here to view code image 


tm [Avi topics = ai} 


ine ik Oh. FOr trend ain nyc Lrs: 


topiesi[Erend [| i nameni] = trend tweet volume" | 


Next, let’s create a WordCloud from the topics dictionary’s key—value pairs, then 
output the word cloud to the image file TrendingTwitter.png (shown after the code). 
The argument prefer horizontal=0.5 suggests that 50% of the words should be 
horizontal, though the software may ignore that to fit the content: 


lick here to view code image 


In [49]: from wordcloud import WordCloud 


n [50]: wordcloud = WordCloud(width=1600, height=900, 


preter horizontal=0. 5; Min font size=l0 colormap=' prismi; 





packground color='white') 


Ta [51]: wordeloud = wordcloud. fit words (topxzes) 








inp P52) wordelloud=="“wordeloud) to fale ( i esnding iwi CECE BAJS) 


The resulting word cloud is shown below—yours will differ based on the trending topics 


the day you run the code: 


#WednesdayWisdom Jordan Edwards 


#TuesdayThoughts 


LUES ea rthquak 


Ron. DeSant Sue IKG cwernangersoay 


C» Emmett T1 


uis CKSE. 


T: ù 
2i I Have L reambabor Day: py 
v a 


2#1D0L100M 2 


+£4NationalBowtieDay Matt Smith Martha McSally TF 


#ElectionDay 


#HappyBirthdayl iam 


#BachelorInParadise ™ Times Square 


12.12 CLEANING/PREPROCESSING TWEETS FOR 
ANALYSIS 


Data cleaning is one of the most common tasks that data scientists perform. Depending on 
how you intend to process tweets, you'll need to use natural language processing to 
normalize them by performing some or all of the data cleaning tasks in the following table. 
Many of these can be performed using the libraries introduced in the “ atural Language 


rocessing (NLP)” chapter: 


Tweet cleaning tasks 





Converting all text to the same case 


Removing stop words 


emoving # symbol from hashtags 
Removing @-mentions 
Removing duplicates 
Removing excess whitespace 
Removing hashtags 


Removing punctuation 


Removing RT (retweet) and FAV (favorite) 
Removing URLs 

Stemming 

Lemmatization 


Tokenization 


tweet-preprocessor Library and TextBlob Utility Functions 


In this section, we'll use the tweet-preprocessor library 


ttps://github.com/s/preprocessor 


to perform some basic tweet cleaning. It can automatically remove any combination of: 


e URLs, 
e @-mentions (like @nasa), 


e hashtags (like #mars), 


e Twitter reserved words (like, RT for retweet and FAV for favorite, which is similar to a 


“like” on other social networks), 
e emojis (all or just smileys) and 


e numbers 


The following table shows the module’s constants representing each option: 


Option constant 





@-Mentions (e.g., @nasa) OPT.MENTION 


Emoji OPT.EMOJI 


Hashtag (e.g., #mars) OPT. HASHTAG 


Number OPT.NUMBER 


Reserved Words (RT and FAV) OPT.RESERVED 





Smiley OPT. SMILEY 


URL Ol. UIE 


Installing tweet-preprocessor 
To install tweet-preprocessor, open your Anaconda Prompt (Windows), Terminal 
(macOS/Linux) or shell (Linux), then issue the following command: 


pip install tweet-preprocessor 


Windows users might need to run the Anaconda Prompt as an administrator for proper 
software installation privileges. To do so, right-click Anaconda Prompt in the start menu 


and select More > Run as administrator. 


Cleaning a Tweet 


Let’s do some basic tweet cleaning that we'll use in a later example in this chapter. The 
tweet-preprocessor library’s module name is preprocessor. Its documentation 


recommends that you import the module as follows: 


lick here to view code image 


In [1]: import preprocessor as p 


To set the cleaning options you'd like to use call the module’s set_options function. In 


this case, we’d like to remove URLs and Twitter reserved words: 


lick here to view code image 


DMs IPAs pase sop ons (Pp TOP Tl URI, p.OPT.RESERVED) 


Now let’s clean a sample tweet containing a reserved word (RT) and a URL: 


lick here to view code image 


In [3]: tweet _ text = 'RT A sample retweet with a URL https://nasa.gov' 
In TAI: p- -clean (tweet text) 
Out[4]: 'A sample retweet with a URL' 


12.13 TWITTER STREAMING API 


Twitter’s free Streaming API sends to your app randomly selected tweets dynamically as 
they occur—up to a maximum of one percent of the tweets per day. According to 
InternetLiveStats.com, there are approximately 6000 tweets per second, which is 
over 500 million tweets per day. 3 So the Streaming API gives you access to approximately 
five million tweets per day. Twitter used to allow free access to 10% of streaming tweets, 
but this service—called the fire hose—is now available only as a paid service. In this 
section, we'll use a class definition and an [Python session to walk through the steps for 
processing streaming tweets. Note that the code for receiving a tweet stream requires 
creating a custom class that inherits from another class. These topics are covered in 


hapter 10. 


3 ttp://www.internetlivestats.com/twitter-statistics/. 


12.13.1 Creating a Subclass of StreamListener 


The Streaming API returns tweets as they happen that match your search criteria. Rather 
than connecting to Twitter on each method call, a stream uses a persistent connection to 
push (that is, send) tweets to your app. The rate at which those tweets arrive varies 
tremendously, based on your search criteria. The more popular a topic is, the more likely 


it is that the tweets will arrive quickly. 


You create a subclass of Tweepy’s StreamListener class to process the tweet stream. 


An object of this class is the listener that’s notified when each new tweet (or other 


essage sent by Twitter *) arrives. Each message Twitter sends results in a call to a 
StreamListener method. The following table summarizes several such methods. 
StreamListener already defines each method, so you redefine only the methods you 


need—this is known as overriding. For additional StreamListener methods, see: 


4For details on the messages, see 
ttps://developer.twitter.com/en/docs/tweets/filter-realtime- 


guides/streaming-message-types.html. 


ttps://github.com/tweepy/tweepy/blob/master/tweepy/streaming.py 


Description 





Called when you successfully connect to the Twitter 
on connect (self) stream. This is for statements that should execute only if 


your app is connected to the stream. 


on status (self, Called when a tweet arrives—s tatus is an object of 


status) Tweepy’s Status. 


Called when a limit notice arrives. This occurs if your 


n search matches more tweets than Twitter can deliver 
On limir (seli, F i ae : 
T . based on its current streaming rate limits. In this case, 
Tra 
the limit notice contains the number of matching tweets 


that could not be delivered. 


on error (celr, : : 
Called in response to error codes sent by Twitter. 
Stratis Coce) 


Called if the connection times out—that is, the Twitter 
on timeout (self) : F 
E server is not responding. 


Called if Twitter sends a disconnect warning to indicate 


that the connection might be closed. For example, 
on warning (self, Twitter maintains a queue of the tweets it’s pushing to 


Ta your app. If the app does not read the tweets fast enough, 
on _warning’s notice argument will contain a warning 
message indicating that the connection will terminate if 


the queue becomes full. 


Class TweetListener 


Our StreamListener subclass TweetListener is defined in tweetlistener.py. We 
discuss the Tweet Listener’s components here. Line 6 indicates that class 
TweetListener is a subclass of tweepy.StreamListener. This ensures that our new 


class has class StreamListener’s default method implementations. 


lick here to view code image 


# tweetlistener.py 

nuutweepy.StreamListener subclass that processes tweets as they arrive. 
import tweepy 

from textblob import TextBlob 





class TweetListener (tweepy.StreamListener): 


"""Handles incoming Tweet Siaeamy, MAN 


© y nA UW FWD FE 











Class TweetListener: init Method 


The following lines define the TweetListener class’s init _ method, which is called 
when you create a new TweetListener object. The api parameter is the Tweepy API 
object that Tweet Listener will use to interact with Twitter. The limit parameter is 
the total number of tweets to process—10 by default. We added this parameter to enable 
you to control the number of tweets to receive. As you'll soon see, we terminate the stream 
when that limit is reached. If you set Limit to None, the stream will not terminate 
automatically. Line 11 creates an instance variable to keep track of the number of tweets 
processed so far, and line 12 creates a constant to store the 1 imit. If you’re not familiar 
with init and super () from previous chapters, line 13 ensures that the api object 


is stored properly for use by your listener object. 


lick here to view code image 


9 det Minit (sef m api limit): 




















10 """Create instance variables for tracking number of tweets.""" 
11 self.tweet count = 0 
12 self.TWEET_LIMIT = limit 
13 Super (i anit api) a cal süperclasgs S Init 
14 
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Class TweetListener:on_ connect Method 


Method on_connect is called when your app successfully connects to the Twitter stream. 


We override the default implementation to display a “Connection successful” message. 


lick here to view code image 


15 def rONCONNECE(SETE)E: 

16 """Called when your connection attempt is successful, enabling 
17 you to perform appropriate: application tasks at that POLNE. 
18 print (Connection successful\n') 

19 








Class TweetListener:on status Method 


Method on_ status is called by Tweepy when each tweet arrives. This method’s second 
parameter receives a Tweepy Status object representing the tweet. Lines 23—26 get the 
tweet’s text. First, we assume the tweet uses the new 280-character limit, so we attempt to 


access the tweet’s extended tweet property and get its full text. An exception will 








occur if the tweet does not have an extended_ tweet property. In this case, we get the 
text property instead. Lines 28—30 then display the screen name of the user who sent 
the tweet, the lang (that is language) of the tweet and the tweet text. If the language is 
not English ("en"), lines 32—33 use a Text Blob to translate the tweet and display it in 


English. We increment self.tweet count (line 36), then compare it to 














self.TWEET LIMIT in the return statement. Ifon status returns True, the stream 


remains open. When on_ status returns False, Tweepy disconnects from the stream. 


lick here to view code image 





20 def On status (seli Status): 

21 """Called when Twitter pushes a new tweet to you.""" 
22 # get the tweet text 

23 EEY 

24 tweet_text = status.extended tweet.full text 

25 except: 


26 tweet _ text = status.text 


27 











28 PriImiGE Sereen name: (Status user. Sereen name: t) 
29 print Get Language: {status.lang}') 
30 PEIN EGE" Status: Geweer text) 
31 
32 if status.lang "= Ven": 
33 primeti Translated: (extBlob (tweet Lexc) trans lace)" ) 
34 
35 printet) 
36 Self jtweet Count += 1 # track number of tweets processed 
37 
38 7 Lf TWEE CIMT Si reached return kallse to terminate streamir 
39 return Sci Ewect count = Self. TWEED ia hManE 
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2.13.2 Initiating Stream Processing 


Let’s use an IPython session to test our new TweetListener. 


Authenticating 


First, you must authenticate with Twitter and create a Tweepy API object: 


lick here to view code image 


In [1]: import tweepy 





in (2): import keys 
In [3]: auth = tweepy.OAuthHandler(keys.consumer key, 


keys.consumer secret) 


Im Al: auth. set access token(keys.accessi token, 


keys.access CoOken secret) 


TA Sis api = stweepy.APi(auth,” vait onirate la miut= True, 


wait on rate limit notify=True) 





Creating a TweetListener 


Next, create an object of the Tweet Listener class and initialize it with the api object: 


lick here to view code image 


In [6]: from tweetlistener import TweetListener 


Im [vl tweet listener = Tweethisitener (api) 





We did not specify the 1imit argument, so this TweetListener terminates after 10 


tweets. 


Creating a Stream 


A Tweepy Stream object manages the connection to the Twitter stream and passes the 
messages to your TweetListener. The Stream constructor’s auth keyword argument 
receives the api object’s auth property, which contains the previously configured 


OAuthHandler object. The 1istener keyword argument receives your listener object: 


lick here to view code image 





In [8]: tweet stream = tweepy.Stream(auth=api.auth, 


listener=tweet listener) 


Starting the Tweet Stream 


The Stream object’s filter method begins the streaming process. Let’s track tweets 
about the NASA Mars rovers. Here, we use the track parameter to pass a list of search 


terms: 


lick here to view code image 


in [9]: tweet stream. filter (track=([ "Mars Roveri], is asyncSTrue) 


The Streaming API will return full tweet JSON objects for tweets that match any of the 
terms, not just in the tweet’s text, but also in @-mentions, hashtags, expanded URLs and 
other information that Twitter maintains in a tweet object’s JSON. So, you might not see 


the search terms youre tracking if you look only at the tweet’s text. 


Asynchronous vs. Synchronous Streams 


The is async=True argument indicates that filter should initiate an asynchronous 
tweet stream. This allows your code to continue executing while your listener waits to 


receive tweets and is useful if you decide to terminate the stream early. When you execute 





an asynchronous tweet stream in IPython, you'll see the next In [] prompt and can 
terminate the tweet stream by setting the St ream object’s running property to False, 


as in: 


tweet_stream.running=False 


Without the is async=True argument, filter initiates a synchronous tweet 





stream. In this case, [Python would display the next In [] prompt after the stream 
terminates. Asynchronous streams are particularly handy in GUI applications so your 
users can continue to interact with other parts of the application while tweets arrive. The 


following shows a portion of the output consisting of two tweets: 


lick here to view code image 


Connection successful 


Screen name: bevjoy: 
Language: en 
Status: RT @SPACEdotcom: With Mars Dust Storm Clearing, Opportunity R 





creen name: tourmalinel973: 
Language: en 


Status: RT @BennuBirdy: Our beloved Mars rover isn't done yet, but sh 








Other filter Method Parameters 


Method filter also has parameters for refining your tweet searches by Twitter user ID 


numbers (to follow tweets from specific users) and by location. For details, see: 


ttps://developer.twitter.com/en/docs/tweets/filter-realtime/guides/basic-stream- 


arameters 


Twitter Restrictions Note 


Marketers, researchers and others frequently store tweets they receive from the Streaming 
API. If you’re storing tweets, Twitter requires you to delete any message or location data 
for which you receive a deletion message. This will occur if a user deletes a tweet or the 
tweets location data after Twitter pushes that tweet to you. In each case, your listener’s 


on_delete method will be called. For deletion rules and message details, see 


ttps://developer.twitter.com/en/docs/tweets/filter-realtime/guides/streamin 











> 





2.14 TWEET SENTIMENT ANALYSIS 


Inthe“ atural Language Processing (NLP)” chapter, we demonstrated sentiment analysis 
on sentences. Many researchers and companies perform sentiment analysis on tweets. For 
example, political researchers might check tweet sentiment during elections season to 
understand how people feel about specific politicians and issues. Companies might check 
tweet sentiment to see what people are saying about their products and competitors’ 


products. 


In this section, we'll use the techniques introduced in the preceding section to create a 
script (sentimentlistener. py) that enables you to check the sentiment on a specific 
topic. The script will keep totals of all the positive, neutral and negative tweets it processes 


and display the results. 


The script receives two command-line arguments representing the topic of the tweets you 
wish to receive and the number of tweets for which to check the sentiment—only those 
tweets that are not eliminated are counted. For viral topics, there are large numbers of 
retweets, which we are not counting, so it could take some time get the number of tweets 


you specify. You can run the script from the ch12 folder as follows: 


lick here to view code image 


ipython sentimentlistener.py football 10 


which produces output like the following. Positive tweets are preceded by a +, negative 


tweets by a - and neutral tweets by a space: 


lick here to view code image 


ftblNeutral: Awful game of football. So boring slow hoofball complete was 





+ TBulmer28: I've seen 2 successful onside kicks within a 40 minute span. ] 








+ CMayADayl2: The last normal Sunday for the next couple months. Don't text 


rpimusic: My heart legitimately hurts for Kansas football fans 


+ DSCunningham30: @LeahShieldsWPSD It's awsome that u like college football] 





damanr: I'm bummed I don't know enough about football to roast @samesfanc 
+ jJamesianosborne: @TheRochaSays @WatfordFC @JackHind Haha.... just when yc 
+ Tshanerbeer: @PennStateFball @PennStateOnBTN Ah yes, welcome back college 


- cougarhokie: @hokiehack @skiptyler I can verify the badness of that footk 


+ Unite Reddevils: @Pablo di Don Well make yourself clear HES FOotbalis AOL 








Tweet sentiment for "football" 
Positive: 6 
Neutral: 2 





Negative: 2 








he script (sentimentlistener.py) is presented below. We focus only on the new 


capabilities in this example. 


Imports 


Lines 4-8 import the keys. py file and the libraries used throughout the script: 


lick here to view code image 








# sentimentlisener.py 
"""Script that searches for tweets that match a search string 
and tallies the number of positive, neutral and negativ tweets.""" 


import keys 

import preprocessor as p 
import sys 

from textblob import TextBlob 
import tweepy 


ow OA HD WH F&F WD 


ClassSentimentListener: init Method 





In addition to the API object that interacts with Twitter, the init method receives 


three additional parameters: 


e sentiment dict—a dictionary in which we'll keep track of the tweet sentiments, 


e topic—the topic we’re searching for so we can ensure that it appears in the tweet text 


and 


e limit—the number of tweets to process (not including the ones we eliminate). 


Each of these is stored in the current SentimentListener object (self). 


lick here to view code image 


10 class SentimentListener (tweepy.StreamListener) : 


11 """Handles incoming Tweet scream oye 


12 


























13 def Finit (self api, sentiment diet, Eopier amit= lge 
14 ToC One LOUee: Ch SentimentListener.""" 

15 self sentiment dict = sentiment dict 

16 self.tweet count = 0 

L7 self.topic = topic 

18 Sele EWNEET ETML S lem te 

T9 

20 # set tweet-preprocessor to remove URLs/reserved words 
21 pase opevons (pP-OPT URG, p.OPT RESERVED) 

22 süper anit (api) call superclass"s init 

23 


Method on status 


When a tweet is received, method on_ status: 


e gets the tweet’s text (lines 27-30) 

e skips the tweet if it’s a retweet (lines 33-34) 

e cleans the tweet to remove URLs and reserved words like RT and FAV (line 36) 
e skips the tweet if it does not have the topic in the tweet text (lines 39—40) 


e uses a TextBlob to check the tweet’s sentiment and updates the sentiment dict 


accordingly (lines 43-52) 


e prints the tweet text (line 55) preceded by + for positive sentiment, space for neutral 


sentiment or - for negative sentiment and 


e checks whether we’ve processed the specified number of tweets yet (lines 57—60). 


lick here to view code image 








24 def on status(selt, status): 

25 """Called when Twitter pushes a new BWC CO youn wae 
26 # get the tweet's text 

27 EEY: 

28 tweet_text = Status extended tweet. full text 
29 except: 

30 tweet _ text = status.text 

31 

32 # ignore retweets 

33 if tweet Cert- StartSwi itn (CRT): 

34 return 

35 








36 tweet text = p.clean(tweet_text) # clean the tweet 


37 


















































38 # ignore tweet if the topic is not in the tweet text 
39 if selt topic- lower ()) not in tweet texto Lower): 
40 return 
41 
42 t update self. sentiment dict with the: polarity 
43 blob = TextBlob (tweet text) 
44 if blob.sentiment.polarity > 0: 
45 sentiment = 
46 Self sentiment dicti positive” | t= 1 
47 elif blob.sentiment.polarity == 0: 
48 sentiment = ! 
49 selirsentimentodrcr | mMeutral™ | t= 2 
50 else: 
51 sentiment =a! 
52 self sentiment dicti negativel t= 1 
53 
54 # display the tweet 
55 print(f'{sentiment} {status.user.screen name}: [tweet textan) 
56 
57 self.tweet count += 1. % track number of tweets processed 
58 
59 t Lf TWEET LIMITI is reached, return False to terminate streamir 
60 returni self tweet county! = "self TWEET NEMET 
61 
4 > 
ain Application 


The main application is defined in the function main (lines 62-87; discussed after the 
following code), which is called by lines 90—91 when you execute the file as a script. So 
sentiment-listener.py can be imported into IPython or other modules to use class 


SentimentListener as we did with Tweet Listener in the previous section: 


lick here to view code image 


62 def main(): 














63 # configure the OAuthHandler 

64 auth = tweepy.OAuthHandler(keys.consumer_ key, keys.consumer_ secret) 
65 auth.set access. token(keys.access token, keys.access token secret) 
66 

67 # get the API object 

68 api = tweepy.API (auth, wait_on_rate limit =True, 

69 wait on rate limit notify=True) 

70 

71 # create the StreamListener subclass object 

72 search _ key = sys.argv[1] 

T3 limit = int(sys.argv[2]) # number of tweets to tally 

74 sentiment dict = {positive : OF “neutral: 0; “negative™: 0} 

75 sentiment listener = SentimentListener (api, 


76 sentiment duet, search key, limit) 


77 














78 # set up Stream 

79 stream = tweepy.Stream(auth=api.auth, listener=sentiment listener) 
80 

81 t Start fFlltering Bnglish tweets containing search key 

82 stream.filter(track=[search_ key], hanguages=[Yen"]|, rs asyne=False) 
83 

84 prine iurwect Sentiment, hor Mpsearch Keyra) 

85 rs BROSA Civera SENT iment AICE POST EEV NN) 

86 prints UNeutcra l 7 sentiment dict neucral ii) 

87 prine (Negatives r Sentiment dicti negative ii) 

88 

89 # call main if this file is executed as a script 

90 it name == ' main": 

91 main () 











ines 72-73 get the command-line arguments. Line 74 creates the sentiment dict 
dictionary that keeps track of the tweet sentiments. Lines 75-76 create the 
SentimentListener. Line 79 creates the Stream object. We once again initiate the 
stream by calling St ream method filter (line 82). However, this example uses a 
synchronous stream so that lines 84-87 display the sentiment report only after the 
specified number of tweets (1imit) are processed. In this call to filter, we also 
provided the keyword argument languages, which specifies a list of language codes. The 


one language code 'en' indicates Twitter should return only English language tweets. 


12.15 GEOCODING AND MAPPING 


In this section, we'll collect streaming tweets, then plot the locations of those tweets. Most 
tweets do not include latitude and longitude coordinates, because Twitter disables this by 
default for all users. Those who wish to include their precise location in tweets must opt 
into that feature. Though most tweets do not include precise location information, a large 
percentage include the user’s home location information; however, even that is sometimes 


invalid, such as “Far Away” or a fictitious location from a user’s favorite movie. 


In this section, for simplicity, we'll use the Location property of the tweet’s User object 
to plot that user’s location on an interactive map. The map will let you zoom in and out 
and drag to move the map around so you can look at different areas (known as panning). 
For each tweet, we'll display a map marker that you can click to see a popup containing 


the user’s screen name and tweet text. 


We'll ignore retweets and tweets that do not contain the search topic. For other tweets, 
we'll track the percentage of tweets with location information. When we get the latitude 
and longitude information for those locations, we'll also track the percentage of those 


tweets that had invalid location data. 


geopy Library 


Well use the geopy library ( ttps://github.com/geopy/geopy) to translate 
locations into latitude and longitude coordinates—known as geocoding—so we can place 
markers on a map. The library supports dozens of geocoding web services, many of which 
have free or lite tiers. For this example, we'll use the OpenMapQuest geocoding 


service (discussed shortly). You installed geopy in ection 12.6. 


OpenMapQuest Geocoding API 


We'll use the OpenMapQuest Geocoding API to convert locations, such as Boston, MA 
into their latitudes and longitudes, such as 42.3602534 and -71.0582912, for plotting on 
maps. OpenMapQuest currently allows 15,000 transactions per month on their free tier. 


To use the service, first sign up at 
ttps://developer.mapquest.com/ 


Once logged in, go to 


ttps://developer.mapquest.com/user/me/apps 


and click Create a New Key, fill in the App Name field with a name of your choosing, 
leave the Callback URL empty and click Create App to create an API key. Next, click 
your app’s name in the web page to see your consumer key. In the keys. py file you used 


earlier in the chapter, store the consumer key by replacing YourKeyHere in the line 
mapquest_key = 'YourKeyHere' 


As we did earlier in the chapter, we'll import keys. py to access this key. 


Folium Library and Leaflet.js JavaScript Mapping Library 


For the maps in this example, we'll use the folium library 
ttps://github.com/python-visualization/folium 


which uses the popular Leaflet.js JavaScript mapping library to display maps. The maps 
that folium produces are saved as HTML files that you can view in your web browser. To 


install folium, execute the following command: 


pip install folium 


Maps from OpenStreetMap.org 


By default, Leaflet.js uses open source maps from OpenStreetMap.org. These maps are 
copyrighted by the OpenStreetMap.org contributors. To use these maps °, they require the 
following copyright notice: 


9 ttps://wiki.osmfoundation.org/wiki/Licence/Licence and Legal FAQ. 


lick here to view code image 
Map data © OpenStreetMap contributors 


and they state: 


You must make it clear that the data is available under the Open Database License. This 
can be achieved by providing a “License” or “Terms” link which links to 
ww.openstreetmap.org/copyright or 


ww.opendatacommons.org/licenses/odbl. 


12.15.1 Getting and Mapping the Tweets 


Let’s interactively develop the code that plots tweet locations. We'll use utility functions 
from our tweetutilities.py file and class LocationListener in 
locationlistener.py. We'll explain the details of the utility functions and class in the 


subsequent sections. 


Get the API Object 


As in the other streaming examples, let’s authenticate with Twitter and get the Tweepy 








API object. In this case, we do this via the get_ API utility function in 


tweetutilities.py: 


lick here to view code image 


in [lj from tweertutilitres amport get ART 


Dn Wi2i api = get ART() 


Collections Required By LocationListener 


Our LocationListener class requires two collections: A list (tweets) to store the 
tweets we collect and a dictionary (counts) to track the total number of tweets we collect 


and the number that have location data: 


lick here to view code image 


in bels tweets = ih] 


Inna counts = eOtal Eweees: |): Co Locations: O 
Creating the LocationListener 
For this example, the LocationListener will collect 50 tweets about 'football': 


lick here to view code image 


In [5]: from locationlistener import LocationListener 


in [oj Location listener = Locabrvoniasiener (apis, Counts duet Counts, 


tweets list=tweets, LOptc="roorbal ll") haimit=30) 


The LocationListener will use our utility function get_tweet_content to extract 


the screen name, tweet text and location from each tweet, place that data in a dictionary. 


Configure and Start the Stream of Tweets 


Next, let’s set up our St ream to look for English language ' footbal1' tweets: 
lick here to view code image 
In [7]: import tweepy 


In [8]: stream = tweepy.Stream(auth=api.auth, listener=Location listener) 


in Pls stream. t a eeri(itrack=("toorwad i” |), languages=[ ven"), ist async False 
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ow wait to receive the tweets. Though we do not show them here (to save space), the 
LocationListener displays each tweet’s screen name and text so you can see the live 
stream. If you’re not receiving any (perhaps because it is not football season), you might 
want to type Ctrl + C to terminate the previous snippet then try again with a different 


search term. 


Displaying the Location Statistics 





When the next In [] prompt displays, we can check how many tweets we processed, how 


many had locations and the percentage that had locations: 


lick here to view code image 


Ta [lO] counmes |) total tweets] 
Outi Loene 


ta pP counts locations] 
Ouse ise 50) 











In Als print (Te Counts locations d le; “Counts | eorel eweets alas jo") 


In this particular execution, 79.4% of the tweets contained location data. 


Geocoding the Locations 


Now, let’s use our get_geocodes utility function from tweetutilities.py to 


geocode the location of each tweet stored in the list tweets: 


lick here to view code image 


Ine [3 LOM itveescutiiitiesi import Get geocodes 


In [14]: bad_locations = get geocodes (tweets) 





Getting coordinates for tweet locations... 





OpenMapQuest service timed out. Waiting. 





OpenMapQuest service timed out. Waiting. 


Done geocoding 


Sometimes the OpenMapQuest geocoding service times out, meaning that it cannot 
handle your request immediately and you need to try again. In that case, our function 
get _geocodes displays a message, waits for a short time, then retries the geocoding 


request. 


As youll soon see, for each tweet with a valid location, the get_geocodes function adds 
to the tweet’s dictionary in the tweets list two new keys—'latitude' and 
"longitude '. For the corresponding values, the function uses the tweet’s coordinates 


that OpenMapQuest returns. 


Displaying the Bad Location Statistics 





When the next In [] prompt displays, we can check the percentage of tweets that had 


invalid location data: 


lick here to view code image 


ime [eS bade locations 


Outil ode 7 
ine GWE print (£: fpad LoOcatrons / Counes: ! Vocat rons k lc) 
14.0% 


In this case, of the 50 tweets with location data, 7 (14%) had invalid locations. 


Cleaning the Data 


Before we plot the tweet locations on a map, let’s use a pandas DataFrame to clean the 








data. When you create a DataFrame from the tweets list, it will contain the value NaN 


for the 'latitude' and 'longitude' of any tweet that did not have a valid location. 





We can remove any such rows by calling the DataFrame’s dropna method: 
lick here to view code image 
ime [Aik importe pandas as pad 


In [18]: df = pd.DataFrame (tweets) 


TA de = dE dropnar() 
Creating a Map with Folium 
Now, let’s create a folium Map on which we'll plot the tweet locations: 
lick here to view code image 


im [ZO] importe folium 


TaT usmap - coliumeMap(locabrvon=(39 3203, —28. 5795), 


tiles='Stamen Terrain', 





ZOOM _ Start=o, detect retina=True) 





The location keyword argument specifies a sequence containing latitude and longitude 
coordinates for the map’s center point. The values above are the geographic center of the 
continental United States ( ttp://bit.ly/CenterOfTheuS). It’s possible that some 
of the tweets we plot will be outside the U.S. In this case, you will not see them initially 
when you open the map. You can zoom in and out using the + and - buttons at the top-left 
of the map, or you can pan the map by dragging it with the mouse to see anywhere in the 


world. 


The zoom start keyword argument specifies the map’s initial zoom level, lower values 
show more of the world and higher values show less. On our system, 5 displays the entire 
continental United States. The detect retina keyword argument enables folium to 
detect high-resolution screens. When it does, it requests higher-resolution maps from 


OpenStreetMap.org and changes the zoom level accordingly. 


Creating Popup Markers for the Tweet Locations 





Next, let’s iterate through the DataFrame and add to the Map folium Popup objects 


containing each tweet’s text. In this case, we'll use method itertuples to create tuples 








from each row of the DataFrame. Each tuple will contain a property for each DataFrame 


column: 


lick here to view code image 


Poaz ror ie in dE abieieuayes by oulietsh(()) cc 

text = ne Toine Sereen name; t.text]) 

popup = folium.Popup (text, parse html=True) 

marker = folium.Marker((t.latitude, t.longitude), 


popup=popup) 





marker.add_to(usmap) 


First, we create a string (text) containing the user’s screen name and tweet text 
separated by a colon. This will be displayed on the map if you click the corresponding 
marker. The second statement creates a folium Popup to display the text. The third 
statement creates a folium Marker object using a tuple to specify the Marker’s latitude 
and longitude. The popup keyword argument associates the tweet’s Popup object with the 
new Marker. Finally, the last statement calls the Marker’s add_to method to specify 


the Map that will display the Marker. 


Saving the Map 


The last step is to call the Map’s save method to store the map in an HTML file, which you 


can then double click to open in your web browser: 


lick here to view code image 


In [23]: usmap.save('tweet_map.html') 


The resulting map follows. The Markers on your map will differ: 


CoachJoshFizel: @SpoonerFootball awesome 
stuff watch this! #FAMILY 





Map data © OpenStreetMap contributors. 


The data is available under the Open Database License 
ttp://www.openstreetmap.org/copyright. 


12.15.2 Utility Functions in tweetutilities.py 


Here we present the utility functions get_ tweet content andget geo codes used 
in the preceding section’s [Python session. In each case, the line numbers start from 1 for 
discussion purposes. These are both defined in tweetutilities.py, which is included 


in the ch12 examples folder. 


get tweet content Utility Function 


Function get_tweet content receives a Status object (tweet) and creates a 
dictionary containing the tweet’s screen name (line 4), text (lines 7-10) and 
location (lines 12-13). The location is included only if the location keyword 
argument is True. For the tweet’s text, we try to use the full _ text property of an 


extended tweet. If it’s not available, we use the text property: 


lick here to view code image 





1 def get tweet content (tweet, location=Palse) : 

2 ""wReturn dictionary with data from tweet la otatus obqgeck). ou" 
3 fields = {} 

4 fields ("screen name'] = tweet.user.screen name 

5 

6 # get the tweet's text 

7 CEY 

8 feldo i texti = tweet extended tweet CUN Eext 

9 except: 

a 


0 fields['text'] = tweet.text 


11 





12 it OCA Ome 

13 fields['location'] = tweet.user.location 
14 

15 return fields 


get _geocodes Utility Function 


Function get _geocodes receives a list of dictionaries containing tweets and geocodes 
their locations. If geocoding is successful for a tweet, the function adds the latitude and 
longitude to the tweet’s dictionary in tweet list. This code requires class 
OpenMapQuest from the geopy module, which we import into the file 


tweetutilities.py as follows: 
from geopy import OpenMapQuest 


lick here to view code image 


























L def get geocodesi(tweet list): 

2 """Get the latitude and longitude for each tweet's location. 
3 Returns the number of tweets with invalid location daitan UN 
4 print('Getting coordinates for tweet locations...) 

5 geo = OpenMapQuest (api_key=keys.mapquest_key) # geocoder 

6 bad_locations = 0 

7 

8 for tweet an tweet lrst: 

9 processed = Fals 

10 delay = .1 # used if OpenMapQuest times out to delay next c 
abal while not processed: 

12 try: # get coordinates for tweet ['location'] 

13 geo location = geo.geocode (tweet ['location']) 
14 processed = Tru 

15 except: # timed out, so wait before trying again 
16 print ('OpenMapQuest service timed out. Waiting.') 
17 time.sleep (delay) 

18 delay += .1 

19 

20 if geo location: 

21 tweet “latatude™] = geo location: latitude 

22 tweet "longitude" ] = geo location. Longitude 

23 else: 

24 bad_locations += 1 # tweet ['location'] was invalid 
25 

26 print ('Done geocoding") 

27 return bad locations 





< — P 





The function operates as follows: 


e Line 5 creates the OpenMapQuest object we'll use to geocode locations. The api_key 


keyword argument is loaded from the keys. py file you edited earlier. 


e Line 6 initializes bad_locations which we use to keep track of the number of invalid 


locations in the tweet objects we collected. 


e In the loop, lines 9-18 attempt to geocode the current tweet’s location. Sometimes the 
OpenMapQuest geocoding service will time out, meaning that it’s temporarily 
unavailable. This can happen if you make too many requests too quickly. So, the 
while loop continues executing as long as processed is False. In each iteration, this 
loop calls the OpenMapQuest object’s geocode method with the tweet’s location 
string as an argument. If successful, processed is set to True and the loop terminates. 
Otherwise, lines 16—18 display a time-out message, wait for delay seconds and 
increase the delay in case we get another time out. Line 17 calls the Python Standard 


Library time module’s sleep method to pause the code execution. 


e After the while loop terminates, lines 20—24 check whether location data was 
returned and, if so, add it to the tweet’s dictionary. Otherwise, line 24 increments the 


bad_locations counter. 


e Finally, the function prints a message that it’s done geocoding and returns the 


bad locations value. 


12.15.3 Class LocationListener 


Class LocationListener performs many of the same tasks we demonstrated in the 


prior streaming examples, so we'll focus on just a few lines in this class: 


lick here to view code image 


# locationlistener.py 
"""Receives tweets matching a search string and stores a list of 
dictionaries Containing cach tweets screen _name/text/location.""" 


import tweepy 


class LocationListener(tweepy.StreamListener): 


1 

2 

3 

4 

5 rrom tweetutilirtires Import get tweet Content 

6 

7 

8 """Handles incoming Tweet stream to get location data. SON 
9 














10 det init (selm api, Counts dicti “tweets list COPIC, Limie=10) 
11 "Configures the LocationListener.""" 

12 Selle wEweets: IHSE = Eweers: list 

13 Selle Couns dict = counts dict 

14 selfi topic = topic 

15 self.TWEET LIMIT = limit 






































16 superi rnit api) t eall superclass s Inat 

17 

18 def on statusl(self, status): 

19 """Called when Twitter pushes a new Eweet to youn Tnm 

20 # get cach tweets! sereen mame, text and location 

21 tweet_data = get tweet content (status, location=frue) 

22 

23 # ignore retweets and tweets that do not contain the topic 
24 ie (tweet data text” | sstartswa chi? RE): or 

25 self.topic. lower () not Gm tweet datal text" | ower): 
26 return 

27 

28 self counts dicti total tweets! t 1 t original tweet 

29 

30 # ignore tweets with no location 

31 Le not Status- user. location: 

32 return 

33 

34 self.counts dict['locations'] += 1 # tweet with location 
35 self.tweets list.append(tweet_ data) # store the tweet 

36 PEIneE(E {Status .user.screen name: {tweet data texe'] }\n™) 
37 

38 tf Lt TEET TiMien as reached returni Kallse “to terminate streamat 
39 rectum Soler counts dicti locationsii] I= self. TWEET LIMIT 











n this case, the init method receives a counts dictionary that we use to keep 


track of the total number of tweets processed and a tweet_1ist in which we store the 


dictionaries returned by the get_tweet content utility function. 


Method on status: 


e Calls get tweet content to get the screen name, text and location of each tweet. 


e Ignores the tweet if it is a retweet or if the text does not include the topic we’re 


searching for—we'll use only original tweets containing the search string. 


e Adds 1 to the value of the 'total tweets' keyin the counts dictionary to track the 


number of original tweets we process. 
e Ignores tweets that have no location data. 


e Adds 1 to the value of the 'locations' key inthe counts dictionary to indicate that 


we found a tweet with a location. 


e Appends tothe tweets list the tweet data dictionary that 


get tweet content returned. 


e Displays the tweet’s screen name and tweet text so you can see that the app is making 


progress. 


e Checks whether the TWEET LIMIT has been reached and, if so, returns False to 


terminate the stream. 


12.16 WAYS TO STORE TWEETS 


For analysis, you'll commonly store tweets in: 


e CSV files—A file format that we introduced in the “Files and Exceptions” chapter. 








e pandas DataFrames in memory—CSV files can be loaded easily into DataFrames for 


cleaning and manipulation. 


e SQL databases—Such as MySQL, a free and open source relational database 
management system (RDBMS). 


e NoSQL databases—Twitter returns tweets as JSON documents, so the natural way to 
store them is in a NoSQL JSON document database, such as MongoDB. Tweepy 
generally hides the JSON from the developer. If you’d like to manipulate the JSON 
directly, use the techniques we present in the “ ig Data: Hadoop, Spark, NoSQL and 

oT” chapter, where we'll look at the PyMongo library. 


12.17 TWITTER AND TIME SERIES 


A time series is a sequence of values with timestamps. Some examples are daily closing 
stock prices, daily high temperatures at a given location, monthly U.S. job-creation 
numbers, quarterly earnings for a given company and more. Tweets are natural for time- 
series analysis because they’re time stamped. In the “Machine Learning” chapter, we'll use 
a technique called simple linear regression to make predictions with time series. We'll 
take another look at time series in the “Deep Learning” chapter when we discuss recurrent 


neural networks. 


12.18 WRAP-UP 


In this chapter, we explored data mining Twitter, perhaps the most open and accessible of 
all the social media sites, and one of the most commonly used big-data sources. You 
created a Twitter developer account and connected to Twitter using your account 
credentials. We discussed Twitter’s rate limits and some additional rules, and the 


importance of conforming to them. 


e looked at the JSON representation of a tweet. We used Tweepy—one of the most 
widely used Twitter API clients—to authenticate with Twitter and access its APIs. We saw 
that tweets returned by the Twitter APIs contain much metadata in addition to a tweet’s 
text. We determined an account’s followers and whom an account follows, and looked at a 


user’s recent tweets. 


We used Tweepy Cursors to conveniently request successive pages of results from 
various Twitter APIs. We used Twitter’s Search API to download past tweets that met 
specified criteria. We used Twitter’s Streaming API to tap into the flow of live tweets as 
they happened. We used the Twitter Trends API to determine trending topics for various 


locations and created a word cloud from trending topics. 


We used the tweet-preprocessor library to clean and preprocess tweets to prepare them 
for analysis, and performed sentiment analysis on tweets. We used the folium library to 
create a map of tweet locations and interacted with it to see the tweets at particular 
locations. We enumerated common ways to store tweets and noted that tweets are a 
natural form of time series data. In the next chapter, we’ll present IBM’s Watson and its 


cognitive computing capabilities. 


https://avxhm.se/blogs/hillO 


13. IBM Watson and Cognitive Computing 


Objectives 
In this chapter, you'll: 


m See Watson’s range of services and use their Lite tier to become familiar with them at 


no charge. 
mw Try lots of demos of Watson services. 


mw Understand what cognitive computing is and how you can incorporate it into your 


applications. 
m Register for an IBM Cloud account and get credentials to use various services. 
E Install the Watson Developer Cloud Python SDK to interact with Watson services. 


m Develop a traveler’s companion language translator app by using Python to weave 
together a mashup of the Watson Speech to Text, Language Translator and Text to 


Speech services. 


mw Check out additional resources, such as IBM Watson Redbooks that will help you 


jump start your custom Watson application development. 


Outline 


13.1 Introduction: IBM Watson and Cognitive Computing 
13.2 IBM Cloud Account and Cloud Console 


13.3 Watson Services 


13.4 Additional Services and Tools 


3.5 Watson Developer Cloud Python SDK 

3.6 Case Study: Traveler’s Companion Translation App 

3.6.1 Before You Run the App 

3.6.2 Test-Driving the App 

3.6.3 SimpleLanguageTranslator.py Script Walkthrough 
3.7 Watson Resources 


3.8 Wrap-Up 


13.1 INTRODUCTION: IBM WATSON AND COGNITIVE 
COMPUTING 


In hapter 1, we discussed some key IBM artificial-intelligence accomplishments, 
including beating the two best human Jeopardy! players in a $1 million match. Watson 
won the competition and IBM donated the prize money to charity. Watson 
simultaneously executed hundreds of language-analysis algorithms to locate correct 
answers in 200 million pages of content (including all of Wikipedia) requiring four 
terabytes of storage. *’ ° IBM researchers trained Watson using machine-learning and 


reinforcement-learning techniques—we discuss machine learning in the next chapter. 3 


ttps://www.techrepublic.com/article/ibm-watson-the-inside- 





story-of-how-the-jeopardy--winning-supercomputer-was-born-and- 


hat-it-wants-to-do-next/. 


ttps://en.wikipedia.org/wiki/Watson_ (computer). 





3° ttps://www.aaai.org/Magazine/Watson/watson.php, AI Magazine, Fall 


2010. 


Early in our research for this book, we recognized the rapidly growing importance of 
Watson, so we placed Google Alerts on Watson and related topics. Through those alerts 
and the newsletters and blogs we follow, we accumulated 900+ current Watson-related 
articles, documentation pieces and videos. We investigated many competitive services 
and found Watson’s “no credit card required” policy and free Lite tier services 4 to be 


among friendliest to people who'd like to experiment with Watson’s services at no 


harge. 
4 Always check the latest terms on IBMs website as the terms and services may change. 


IBM Watson is a cloud-based cognitive-computing platform being employed across a 
wide range of real-world scenarios. Cognitive-computing systems simulate the pattern- 
recognition and decision-making capabilities of the human brain to “learn” as they 
consume more data. 5° ĉ 7 We overview Watson’s broad range of web services and 
provide a hands-on Watson treatment, demonstrating many Watson capabilities. The 
table on the next page shows just a few of the ways in which organizations are using 
Watson. 





tp://whatis.techtarget.com/definition/cognitive-computing. 


ig 











ttps://en.wikipedia.org/wiki/Cognitive computing. 


7 





ttps://www.forbes.com/sites/bernardmarr/2016/03/23/what- 





veryone-should-know-about-cognitive-computing. 


Watson offers an intriguing set of capabilities that you can incorporate into your 
applications. In this chapter, you'll set up an IBM Cloud account 8 and use the Lite tier 
and IBM’s Watson demos to experiment with various web services, such as natural 
language translation, speech-to-text, text-to-speech, natural language understanding, 
chatbots, analyzing- text for tone and visual object recognition in images and video. 


We'll briefly overview some additional Watson services and tools. 


8 IBM Cloud previously was called Bluemix. Youll still see bl uemix in many of this 
chapters URLs. 


Watson use cases 





ad targeting fraud prevention 
oe personal assistants 
artificia game playing 
intelligence predictive maintenance 
genetics 
augmented product 
intelligence healthcare recommendations 


augmented reality image processing 


chatbots robots and drones 
IoT (Internet of Things) 


closed captioning self-driving cars 
language translation 
cognitive sentiment and mood 
computing machine learning analysis 
conversational malware detection smart homes 
interfaces i : i 
medical diagnosis and sports 


crime prevention treatment 
supply-chain 


medical imaging 


customer support management 


music 


detecting threat detection 


cyberbullying 


natural language virtual reality 


drug development Processing 


voice analysis 


education natural language 

understanding weather forecasting 
facial recognition 

object recognition workplace safety 


finance 


You'll install the Watson Developer Cloud Python Software Development Kit (SDK) for 
programmatic access to Watson services from your Python code. Then, in our hands-on 
implementation case study, you'll develop a traveler’s companion translation app by 
quickly and conveniently mashing up several Watson services. The app enables 
English-only and Spanish-only speakers to communicate with one another verbally, 
despite the language barrier. You'll transcribe English and Spanish audio recordings to 
text, translate the text to the other language, then synthesize and play English and 
Spanish audio from the translated text. 


Watson is a dynamic and evolving set of capabilities. During the time we worked on this 
book, new services were added and existing services were updated and/or removed 
multiple times. The descriptions of the Watson services and the steps we present were 
accurate as of the time of this writing. We'll post updates as necessary on the book’s 


web page at ww.deitel.com. 


3.2 IBM CLOUD ACCOUNT AND CLOUD CONSOLE 


You'll need a free IBM Cloud account to access Watson’s Lite tier services. Each 
service’s description web page lists the service’s tiered offerings and what you get with 
each tier. Though the Lite tier services limit your use, they typically offer what you'll 
need to familiarize yourself with Watson features and begin using them to develop 
apps. The limits are subject to change, so rather than list them here, we point you to 
each service’s web page. IBM increased the limits significantly on some services while 
we were writing this book. Paid tiers are available for use in commercial-grade 


applications. 
To get a free IBM Cloud account, follow the instructions at: 
ttps://console.bluemix.net/docs/services/watson/index.html#about 


You'll receive an e-mail. Follow its instructions to confirm your account. Then you can 


log in to the IBM Cloud console. Once there, you can go to the Watson dashboard at: 
ttps://console.bluemix.net/developer/watson/dashboard 


where you can: 


e Browse the Watson services. 
e Link to the services you’ve already registered to use. 


e Look at the developer resources, including the Watson documentation, SDKs and 


various resources for learning Watson. 


e View the apps you've created with Watson. 


Later, you'll register for and get your credentials to use various Watson services. You 
can view and manage your list of services and your credentials in the IBM Cloud 
dashboard at: 


ttps://console.bluemix.net/dashboard/apps 


You can also click Existing Services in the Watson dashboard to get to this list. 


13.3 WATSON SERVICES 


This section overviews many of Watson’s services and provides links to the details for 


ach. Be sure to run the demos to see the services in action. For links to each Watson 


service’s documentation and API reference, visit: 
ttps://console.bluemix.net/developer/watson/documentation 


We provide footnotes with links to each service’s details. When you’re ready to use a 


particular service, click the Create button on its details page to set up your credentials. 


Watson Assistant 


The Watson Assistant service °? helps you build chatbots and virtual assistants that 
enable users to interact via natural language text. IBM provides a web interface that you 
can use to train the Watson Assistant service for specific scenarios associated with your 
app. For example, a weather chatbot could be trained to respond to questions like, 
“What is the weather forecast for New York City?” In a customer service scenario, you 
could create chatbots that answer customer questions and route customers to the 
correct department, if necessary. Try the demo at the following site to see some sample 


interactions: 


? ttps://console.bluemix.net/catalog/services/watson-assistant- 


ormerly-conversation. 


ttps://www.ibm.com/watson/services/conversation/demo/index.html#demo 


Visual Recognition 


The Visual Recognition service ° enables apps to locate and understand 
information in images and video, including colors, objects, faces, text, food and 
inappropriate content. IBM provides predefined models (used in the service’s demo), or 
you can train and use your own (as you'll do in the “Deep Learning” chapter). Try the 


following demo with the images provided and upload some of your own: 


° ttps://console.bluemix.net/catalog/services/visual- 


ecognition. 


ttps://watson-visual-recognition-duo-dev.ng.bluemix.net/ 


Speech to Text 


The Speech to Text service, * which we'll use in building this chapter’s app, converts 
speech audio files to text transcriptions of the audio. You can give the service keywords 
to “listen” for, and it tells you whether it found them, what the likelihood of a match 


as and where the match occurred in the audio. The service can distinguish among 
multiple speakers. You could use this service to help implement voice-controlled apps, 
transcribe live audio and more. Try the following demo with its sample audio clips or 


upload your own: 


* ttps://console.bluemix.net/catalog/services/speech-to-text. 


ttps://speech-to-text-demo.ng.bluemix.net/ 


Text to Speech 


The Text to Speech service, * which we'll also use in building this chapter’s app, 
enables you to synthesize speech from text. You can use Speech Synthesis Markup 
Language (SSML) to embed instructions in the text for control over voice inflection, 
cadence, pitch and more. Currently, this service supports English (U.S. and U.K.), 
French, German, Italian, Spanish, Portuguese and Japanese. Try the following demo 


with its plain sample text, its sample text that includes SSML and text that you provide: 
2 


ttps://console.bluemix.net/catalog/services/text-to-speech. 


ttps://text-to-speech-demo.ng.bluemix.net/ 


Language Translator 


The Language Translator service, è which we'll also use in building in this chapter’s 


app, has two key components: 


3 ttps://console.bluemix.net/catalog/services/language- 


ranslator. 


e translating text between languages and 


e identifying text as being written in one of over 60 languages. 


Translation is supported to and from English and many languages, as well as between 


other languages. Try translating text into various languages with the following demo: 


ttps://language-translator-demo.ng.bluemix.net/ 


Natural Language Understanding 


The Natural Language Understanding service * analyzes text and produces 


nformation including the text’s overall sentiment and emotion and keywords ranked 


by their relevance. Among other things, the service can identify 


4 ttps://console.bluemix.net/catalog/services/natural-language- 


nderstanding. 


e people, places, job titles, organizations, companies and quantities. 
e categories and concepts like sports, government and politics. 


e parts of speech like subjects and verbs. 


You also can train the service for industry- and application-specific domains with 
Watson Knowledge Studio (discussed shortly). Try the following demo with its sample 


text, with text that you paste in or by providing a link to an article or document online: 


ttps://natural-language-understanding-demo.ng.bluemix.net/ 


Discovery 


The Watson Discovery service ° shares many features with the Natural Language 
Understanding service but also enables enterprises to store and manage documents. So, 
for example, organizations can use Watson Discovery to store all their text documents 
and be able to use natural language understanding across the entire collection. Try this 


service’s demo, which enables you to search recent news articles for companies: 
> ttps://console.bluemix.net/catalog/services/discovery. 


ttps://discovery-news-demo.ng.bluemix.net/ 


Personality Insights 


The Personality Insights service n analyzes text for personality traits. According to 
the service description, it can help you “gain insight into how and why people think, act, 
and feel the way they do. This service applies linguistic analytics and personality theory 
to infer attributes from a person’s unstructured text.” This information could be used to 
target product advertising at the people most likely to purchase those products. Try the 
following demo with tweets from various Twitter accounts or documents built into the 
demo, with text documents that you paste into the demo or with your own Twitter 


account: 


2 ttps://console.bluemix.net/catalog/services/personality- 


nsights. 


ttps://personality-insights-livedemo.ng.bluemix.net/ 


Tone Analyzer 


The Tone Analyzer service ’ analyzes text for its tone in three categories: 


7 ttps://console.bluemix.net/catalog/services/tone-analyzer. 


e emotions—anger, disgust, fear, joy, sadness. 


e social propensities—openness, conscientiousness, extroversion, agreeableness and 


emotional range. 


e language style—analytical, confident, tentative. 


Try the following demo with sample tweets, a sample product review, a sample e-mail 
or text you provide. You'll see the tone analyses at both the document and sentence 


levels: 


ttps://tone-analyzer-demo.ng.bluemix.net/ 


Natural Language Classifier 


You train the Natural Language Classifier service 8 with sentences and phrases 
that are specific to your application and classify each sentence or phrase. For example, 
you might classify “I need help with your product” as “tech support” and “My bill is 
incorrect” as “billing.” Once you’ve trained your classifier, the service can receive 
sentences and phrases, then use Watson’s cognitive computing capabilities and your 
classifier to return the best matching classifications and their match probabilities. You 
might then use the returned classifications and probabilities to determine the next 
steps in your app. For example, in a customer service app where someone is calling in 
with a question about a particular product, you might use Speech to Text to convert a 
question into text, use the Natural Language Classifier service to classify the text, then 
route the call to the appropriate person or department. This service does not offer a 
Lite tier. In the following demo, enter a question about the weather—the service will 
respond by indicating whether your question was about the temperature or the weather 


conditions: 


8 ttps://console.bluemix.net/catalog/services/natural-language- 


lassifier. 


ttps://natural-language-classifier-demo.ng.bluemix.net/ 


Synchronous and Asynchronous Capabilities 


Many of the APIs we discuss throughout the book are synchronous—when you call a 
function or method, the program waits for the function or method to return before 
moving on to the next task. Asynchronous programs can start a task, continue doing 
other things, then be notified when the original task completes and returns its results. 


Many Watson services offer both synchronous and asynchronous APIs. 


The Speech to Text demo is a good example of asynchronous APIs. The demo processes 
sample audio of two people speaking. As the service transcribes the audio, it returns 
intermediate transcription results, even if it has not yet been able to distinguish among 
the speakers. The demo displays these intermediate results in parallel with the service’s 
continued work. Sometimes the demo displays “Detecting speakers” while the service 
figures out who is speaking. Eventually, the service sends updated transcription results 
for distinguishing among the speakers, and the demo then replaces the prior 


transcription results. 


With today’s multi-core computers and multi-computer clusters, the asynchronous 
APIs can help you improve program performance. However, programming with them 
can be more complicated than programming with synchronous APIs. When we discuss 
installing the Watson Developer Cloud Python SDK, we provide a link to the SDK’s code 
examples on GitHub, where you can see examples that use synchronous and 
asynchronous versions of several services. Each service’s API reference provides the 


complete details. 


13.4 ADDITIONAL SERVICES AND TOOLS 


In this section, we overview several Watson advanced services and tools. 


Watson Studio 


Watson Studio ? is the new Watson interface for creating and managing your Watson 
projects and for collaborating with your team members on those projects. You can add 
data, prepare your data for analysis, create Jupyter Notebooks for interacting with your 
data, create and train models and work with Watson’s deep-learning capabilities. 
Watson Studio offers a single-user Lite tier. Once you’ve set up your Watson Studio Lite 


access by clicking Create on the service’s details web page 


2 ttps://console.bluemix.net/catalog/services/data-science- 


xperience. 

ttps://console.bluemix.net/catalog/services/data-science-experience 
you can access Watson Studio at 

ttps://dataplatform.cloud.ibm.com/ 


Watson Studio contains preconfigured projects. ® Click Create a project to view 


them: 


° ttps://dataplatform.cloud.ibm.com/. 


e Standard—“Work with any type of asset. Add services for analytical assets as you 


need them.” 


e Data Science—“Analyze data to discover insights and share your findings with 


others.” 


e Visual Recognition—“Tag and classify visual content using the Watson Visual 


Recognition service.” 
e Deep Learning—“Build neural networks and deploy deep learning models.” 


e Modeler—“Build modeler flows to train SPSS models or design deep neural 


networks.” 


e Business Analytics—“Create visual dashboards from your data to gain insights 


faster.” 


e Data Engineering—“Combine, cleanse, analyze, and shape data using Data 


Refinery.” 


e Streams Flow—“Ingest and analyze streaming data using the Streaming Analytics 


service.” 


Knowledge Studio 


Various Watson services work with predefined models, but also allow you to provide 
custom models that are trained for specific industries or applications. Watson’s 


Knowledge Studio * helps you build custom models. It allows enterprise teams to 


ork together to create and train new models, which can then be deployed for use by 


Watson services. 


* ttps://console.bluemix.net/catalog/services/knowledge-studio. 


Machine Learning 


The Watson Machine Learning service * enables you to add predictive capabilities 
to your apps via popular machine-learning frameworks, including Tensorflow, Keras, 


scikit-learn and others. You'll use scikit-learn and Keras in the next two chapters. 


2 ttps://console.bluemix.net/catalog/services/machine-learning. 


Knowledge Catalog 


The Watson Knowledge Catalog * * is an advanced enterprise-level tool for 


securely managing, finding and sharing your organization’s data. The tool offers: 


3 ttps://medium.com/ibm-watson/introducing-ibm-watson-knowledge- 


atalog-cf42c13032cl. 


4 ttps://dataplatform. cloud. ibm.com/docs/content/catalog/overview- 


ke. html. 


e Central access to an enterprise’s local and cloud-based data and machine learning 


models. 


e Watson Studio support so users can find and access data, then easily use it in 


machine-learning projects. 


e Security policies that ensure only the people who should have access to specific data 


actually do. 
e Support for over 100 data cleaning and wrangling operations. 


e And more. 


Cognos Analytics 


The IBM Cognos Analytics * service, which has a 30-day free trial, uses AI and 
machine learning to discover and visualize information in your data, without any 
programming on your part. It also provides a natural-language interface that enables 


you to ask questions which Cognos Analytics answers based on the knowledge it gathers 


from your data. 


3 ttps://www.ibm.com/products/cognos-analytics. 


13.5 WATSON DEVELOPER CLOUD PYTHON SDK 


In this section, you'll install the modules required for the next section’s full- 
implementation Watson case study. For your coding convenience, IBM provides the 
Watson Developer Cloud Python SDK (software development kit). Its 
watson_developer_ cloud module contains classes that you'll use to interact with 
Watson services. You'll create objects for each service you need, then interact with the 


service by calling the object’s methods. 


To install the SDK © open an Anaconda Prompt (Windows; open as Administrator), 


Terminal (macOS/Linux) or shell (Linux), then execute the following command 7: 


°For detailed installation instructions and troubleshooting tips, see 
ttps://github.com/watson-developer-cloud/python- 
dk/blob/develop/ README .md. 


7Windows users might need to install Microsofts C++ build tools from 





ttps://visualstudio.microsoft.com/visual-cpp-build-tools/, then 


install the wat son-developer-cloud module. 


pip install --upgrade watson-developer-cloud 


Modules We’ll Need for Audio Recording and Playback 


You'll also need two additional modules for audio recording (PyAudio) and playback 


(PyDub). To install these, use the following commands 8. 





8\Mac users might need to first execute conda install -c conda-forge 


portaudio. 


pip install pyaudio 
pip install pydub 


SDK Examples 


On GitHub, IBM provides sample code demonstrating how to access Watson services 


using the Watson Developer Cloud Python SDK’s classes. You can find the examples at: 


ttps://github.com/watson-developer-cloud/python-sdk/tree/master/examples 


13.6 CASE STUDY: TRAVELER’S COMPANION 
TRANSLATION APP 


Suppose you're traveling in a Spanish-speaking country, but you do not speak Spanish, 
and you need to communicate with someone who does not speak English. You could 
use a translation app to speak in English, and the app could translate that, then speak it 
in Spanish. The Spanish-speaking person could then respond, and the app could 
translate that and speak it to you in English. 


Here, you'll use three powerful IBM Watson services to implement such a traveler’s 
companion translation app, ? enabling people who speak different languages to 
converse in near real time. Combining services like this is known as creating a 
mashup. This app also uses simple file-processing capabilities that we introduced in 


the “Files and Exceptions” chapter. 


These services could change in the future. If they do, well post updates on the books 


web page at ttp://www.deitel.com/books/IntroToPython. 


13.6.1 Before You Run the App 


You'll build this app using the Lite (free) tiers of several IBM Watson services. Before 
executing the app, make sure that you’ve registered for an IBM Cloud account, as we 
discussed earlier in the chapter, so you can get credentials for each of the three services 
the app uses. Once you have your credentials (described below), you'll insert them in 
our keys. py file (located in the ch13 examples folder) that we import into the 


example. Never share your credentials. 


As you configure the services below, each service’s credentials page also shows you the 
service’s URL. These are the default URLs used by the Watson Developer Cloud Python 
SDK, so you do not need to copy them. In ection 13.6.3, we present the 


SimpleLanguageTranslator.py script and a detailed walkthrough of the code. 


Registering for the Speech to Text Service 


This app uses the Watson Speech to Text service to transcribe English and Spanish 
audio files to English and Spanish text, respectively. To interact with the service, you 


must get a username and password. To do so: 


1. Create a Service Instance: Go to 


ttps://console.bluemix.net/catalog/services-—/speech-to-text 
and click the Create button on the bottom of the page. This auto-generates an API 


key for you and takes you to a tutorial for working with the Speech to Text service. 


2. Get Your Service Credentials: To see your API key, click Manage at the top- 
left of the page. To the right of Credentials, click Show credentials, then copy 
the API Key, and paste it into the variable speech to text _key’s string in the 
keys . py file provided in this chapter’s ch13 examples folder. 


Registering for the Text to Speech Service 


In this app, you'll use the Watson Text to Speech service to synthesize speech from text. 


This service also requires you to get a username and password. To do so: 


1. Create a Service Instance: Go to 
ttps://console.bluemix.net/catalog/services/text-to-speech 
and click the Create button on the bottom of the page. This auto-generates an API 


key for you and takes you to a tutorial for working with the Text to Speech service. 


2. Get Your Service Credentials: To see your API key, click Manage at the top- 
left of the page. To the right of Credentials, click Show credentials, then copy 
the API Key and paste it into the variable text_to speech key’s string in the 
keys . py file provided in this chapter’s ch13 examples folder. 


Registering for the Language Translator Service 


In this app, you'll use the Watson Language Translator service to pass text to Watson 
and receive back the text translated into another language. This service requires you to 


get an API key. To do so: 


1. Create a Service Instance: Go to 
ttps://console.bluemix.net/catalog/services-/language- 
ranslator and click the Create button on the bottom of the page. This auto- 

generates an API key for you and takes you to a page to manage your instance of the 


service. 


2. Get Your Service Credentials: To the right of Credentials, click Show 
credentials, then copy the API Key and paste it into the variable 
translate _key’s string in the keys. py file provided in this chapter’s ch13 


examples folder. 


Retrieving Your Credentials 


To view your credentials at any time, click the appropriate service instance at: 


ttps://console.bluemix.net/dashboard/apps 


13.6.2 Test-Driving the App 


Once you’ve added your credentials to the script, open an Anaconda Prompt 
(Windows), a Terminal (macOS/Linux) or a shell (Linux). Run the script ° by executing 


the following command from the ch13 examples folder: 


“The pydub. playback module we use in this app issues a warning when you run our 


script. The warning has to do with module features we dont use and can be ignored. To 














eliminate this warning, you can install f fmpeg for Windows, macOS or Linux from 


ttps://www.ffmpeg.org. 


ipython SimpleLanguageTranslator.py 


Processing the Question 


The app performs 10 steps, which we point out via comments in the code. When the 


app begins executing: 


Step 1 prompts for and records a question. First, the app displays: 








Press Enter then ask your question in English 





and waits for you to press Enter. When you do, the app displays: 


Recording 5 seconds of audio 


Speak your question. We said, “Where is the closest bathroom?” After five seconds, the 


app displays: 


Recording complete 


Step 2 interacts with Watson’s Speech to Text service to transcribe your audio to text 


and displays the result: 





English: where is the closest bathroom 


Step 3 then uses Watson’s Language Translator service to translate the English text to 


Spanish and displays the translated text returned by Watson: 


Spanish: ¿Dónde esta el baño mas cercano? 


Step 4 passes this Spanish text to Watson’s Text to Speech service to convert the text to 


an audio file. 


Step 5 plays the resulting Spanish audio file. 


Processing the Response 


At this point, we’re ready to process the Spanish speaker’s response. 


Step 6 displays: 





Press Enter then speak the Spanish answer 





and waits for you to press Enter. When you do, the app displays: 


Recording 5 seconds of audio 


and the Spanish speaker records a response. We do not speak Spanish, so we used 
Watson’s Text to Speech service to prerecord Watson saying the Spanish response “El 
baño mas cercano esta en el restaurante,” then played that audio loud enough for our 
computer’s microphone to record it. We provided this prerecorded audio for you as 
SpokenResponse.wav in the ch13 folder. If you use this file, play it quickly after 
pressing Enter above as the app records for only 5 seconds. * To ensure that the audio 
loads and plays quickly, you might want to play it once before you press Enter to begin 


recording. After five seconds, the app displays: 


“For simplicity, we set the app to record five seconds of audio. You can control the 
duration with the variable SECONDS in function record _ audio. Its possible to create 
a recorder that begins recording once it detects sound and stops recording after a 


period of silence, but the code is more complicated. 


Recording complete 


Step 7 interacts with Watson’s Speech to Text service to transcribe the Spanish audio 


to text and displays the result: 


Spanish response: el bafio mas cercano esta en el restaurante 





Step 8 then uses Watson’s Language Translator service to translate the Spanish text to 


English and displays the result: 








English response: The nearest bathroom is in the restaurant 


Step 9 passes the English text to Watson’s Text to Speech service to convert the text to 
an audio file. 


Step 10 then plays the resulting English audio. 


13.6.3 SimpleLanguageTranslator.py Script Walkthrough 


In this section, we present the SimpleLanguageTranslator.py script’s source 
code, which we’ve divided into small consecutively numbered pieces. Let’s use a top- 


down approach as we did in the “Control Statements” chapter. Here’s the top: 

Create a translator app that enables English and Spanish speakers to communicate. 
The first refinement is: 

Translate a question spoken in English into Spanish speech. 

Translate the answer spoken in Spanish into English speech. 

We can break the first line of the second refinement into five steps: 

Step 1: Prompt for then record English speech into an audio file. 

Step 2: Transcribe the English speech to English text. 

Step 3: Translate the English text into Spanish text. 

Step 4: Synthesize the Spanish text into Spanish speech and save it into an audio file. 


Step 5: Play the Spanish audio file. 


e can break the second line of the second refinement into five steps: 
Step 6: Prompt for then record Spanish speech into an audio file. 
Step 7: Transcribe the Spanish speech to Spanish text. 
Step 8: Translate the Spanish text into English text. 
Step 9: Synthesize the English text into English speech and save it into an audio file. 
Step 10: Play the English audio. 


This top-down development makes the benefits of the divide-and-conquer approach 


clear, focusing our attention on small pieces of a more significant problem. 


In this section’s script, we implement the 10 steps specified in the second refinement. 
Steps 2 and 7 use the Watson Speech to Text service, Steps 3 and 8 use the Watson 


Language Translator service, and Steps 4 and 9 use the Watson Text to Speech service. 


Importing Watson SDK Classes 


Lines 4—6 import classes from the watson developer cloud module that was 
installed with the Watson Developer Cloud Python SDK. Each of these classes uses the 
Watson credentials you obtained earlier to interact with a corresponding Watson 


service: 


e Class SpeechToTextV1 enables you to pass an audio file to the Watson Speech to 


Text service and receive a JSON ? document containing the text transcription. 


“The V1 in the class name indicates the services version number. As IBM revises its 
services, it adds new classes to the watson developer cloud module, rather 
than modifying the existing classes. This ensures that existing apps do not break 
when the services are updated. The Speech to Text and Text to Speech services are 
each Version 1 (V1) and the Language Translator service is Version 3 (V3) at the 


time of this writing. 
3We introduced JSON in the previous chapter, Data Mining Twitter. 


e Class LanguageTranslatorvV3 enables you to pass text to the Watson Language 


Translator service and receive a JSON document containing the translated text. 


e Class TextToSpeechv1 enables you to pass text to the Watson Text to Speech 


service and receive audio of the text spoken in a specified language. 


lick here to view code image 











1 # SimpleLanguageTranslator.py 

2 """Use IBM Watson Speech to Text, Language Translator and Text to Spe 
3 APIs to enable English and Spanish speakers to communicate., nmu 

4 from watson developer cloud import SpeechToTextV1 

5 from watson developer cloud import LanguageTranslatorv3 

6 from watson developer cloud import TextToSpeechvl 











Other Imported Modules 


Line 7 imports the keys. py file containing your Watson credentials. Lines 8—11 import 


modules that support this app’s audio-processing capabilities: 


e The pyaudio module enables us to record audio from the microphone. 
e pydub and pydub. playback modules enable us to load and play audio files. 


e The Python Standard Library’s wave module enables us to save WAV (Waveform 
Audio File Format) files. WAV is a popular audio format originally developed by 
Microsoft and IBM. This app uses the wave module to save the recorded audio to a 


. wav file that we send to Watson’s Speech to Text service for transcription. 


lick here to view code image 


7 import keys # contains your API keys for accessing Watson services 
8 import pyaudio # used to record from mie 

9 import pydub # used to load a WAV file 

10 import pydub.playback # used to play a WAV file 

11 import wave # used to save a WAV file 

12 








Main Program: Function run translator 


Let’s look at the main part of the program defined in function run translator (lines 
13—54), which calls the functions defined later in the script. For discussion purposes, 
we broke run translator into the 10 steps it performs. In Step 1 (lines 15-17), we 


prompt in English for the user to press Enter, then speak a question. Function 


record audio then records audio for five seconds and stores it in the file 


english.wav: 


lick here to view code image 


I3 def run translatorn): 











14 iV Calic (the functions khat intoract with Watson services.""" 

15 # Step 1: Prompt for then record English speech into an aaoo nET 
16 input('Press Enter then ask your question in English’ 

17 record audio('english.wav') 

18 








In Step 2, we call function speech to text, passing the file english. wav for 
transcription and telling the Speech to Text service to transcribe the text using its 
predefined model 'en-US_BroadbandModel'. “We then display the transcribed 


text: 


4For most languages, the Watson Speech to Text service supports broadband and 
narrowband models. Each has to do with the audio quality. For audio captured at 16 
kHZ and higher, IBM recommends using the broadband models. In this app, we 
capture the audio at 44.1 KHZ. 


lick here to view code image 





19 # Step 2: Transcribe the English speech to English text 

20 english = speech to text ( 

21 fille name=tenglish wav: modell td=“en-US BroadbandModel") 
22 printe (*English:", engilash) 

23 


In Step 3, we call function translate, passing the transcribed text from Step 2 as 
the text to translate. Here we tell the Language Translator service to translate the text 
using its predefined model 'en-es' to translate from English (en) to Spanish (es). 


We then display the Spanish translation: 


lick here to view code image 


24 # Step 3: Translate the English text into Spanish text 
25 spanish = translate (text to trans late-engilish, model= ten- -est 
26 prine V Spans tis i SPANS) 


27 


In Step 4, we call function text to speech, passing the Spanish text from Step 3 





for the Text to Speech service to speak using its voice 'es-US_ SofiaVoice'. We also 


specify the file in which the audio should be saved: 


lick here to view code image 


28 # Step 4: Synthesize the Spanish text into Spanish speech 

29 text to _speech(text_ to speak=spanish, yoice lto uses "es US Sorian 
30 file name='spanish.wav') 

31 





In Step 5, we call function play audio to play the file 'spanish.wav', which 


contains the Spanish audio for the text we translated in Step 3. 


lick here to view code image 


32 # Step 5: Play the Spanish audio file 
33 play _audio(file name='spanish.wav') 
34 


Finally, Steps 6—10 repeat what we did in Steps 1—5, but for Spanish speech to 
English speech: 


e Step 6 records the Spanish audio. 


e Step 7 transcribes the Spanish audio to Spanish text using the Speech to Text 


service’s predefined model 'es-ES_ BroadbandModel'. 


e Step 8 translates the Spanish text to English text using the Language Translator 


Service’s 'es-en' (Spanish-to-English) model. 


e Step 9 creates the English audio using the Text to Speech Service’s voice ' en- 


Us Allisonvoice". 


e Step 10 plays the English audio. 


lick here to view code image 


35 # Step 6: Prompt for then record Spanish speech into an audio fi 








36 input ('Press Enter then speak the Spanish answer') 


37 record audio Sspanushresponse way!) 


38 


























39 # Step 7: Transcribe the Spanish speech to Spanish text 
40 spanish = Speech Co text ( 
41 file name="spanishresponse.wav', model i1d="es-ES BroadbandMode 
42 print Spanish response:', spanish) 
43 
44 # Step 8: Translate the Spanish text into English text 
45 english = translate (text to transillate=spanish, model—‘es—en") 
46 print('English response:', english) 
47 
48 # Step 9: Synthesize the English text into English speech 
49 text to speech(text to speak=english, 
50 vVoiecelto use- ien US All insonVion ces, 
51 file name="englishresponse.wav') 
52 
53 # Step 10: Play the English audio 
54 play _audio(file name='englishresponse.wav') 
55 
4 > 





ow let’s implement the functions we call from Steps 1 through 10. 


Function speech to text 


To access Watson’s Speech to Text service, function speech to text (lines 56-87) 
creates a SpeechToTextV1 object named stt (short for speech-to-text), passing as 


the argument the API key you set up earlier. The with statement (lines 62-65) opens 





the audio file specified by the file name parameter and assigns the resulting file 





object to audio file. The open mode 'rb' indicates that we'll read (r) binary data 
(b)—audio files are stored as bytes in binary format. Next, lines 64—65 use the 
SpeechToTextvV1 object’s recognize method to invoke the Speech to Text service. 


The method receives three keyword arguments: 


e audio is the file (audio file) to pass to the Speech to Text service. 


e content_type is the media type of the file’s contents—' audio/wav' indicates 
that this is an audio file stored in WAV format. ° 


°Media types were formerly known as MIME (Multipurpose Internet Mail 
Extensions) typesa standard that specifies data formats, which programs can use 


to interpret data correctly. 


e model indicates which spoken language model the service will use to recognize the 
speech and transcribe it to text. This app uses predefined models—either ' en- 


US. BroadbandModel' (for English) or 'es-ES BroadbandModel' (for 


Spanish). 


lick here to view code image 





56 def speech to text (file mame, modellid): 


57 
58 
59 
60 
61 
62 
63 
64 
65 
66 
67 
68 
69 
70 
11 
72 
73 
74 
75 
76 
77 
78 
79 
80 
81 
82 
83 
84 
85 
86 
87 
88 


"""Use Watson Speech to Text to convert audio File to hese. aU 
# create Watson Speech to Text client 
stt = SpeechToTextVl (iam_apikey=keys.speech to text key) 








# open the audio file 
wita open(file name, Virb!) as audio fiille: 


t pass the file to Watson for transcription 








result = stt- recognize (audto-audio file, 
content _type='audio/wav', model=model id) .get_result() 

# Get the 'results' list. This may contain intermediate and fina 
# results, depending on method recognize's arguments. We asked 
if bor only final results, So thus list Contains One element. 
results hist = result['results'] 
# Get the final speech recognition result--the list's only eleme 
speech recognition result = results Uistikol 
# Get the 'alternatives' list. This may contain multiple alterna 
# transcriptions, depending on method recognize's arguments. We 
# MOE ask for alternatives, so this list contains one element. 
alternatives list = Speech recognition resule alternatives] 
# Get the only alternative transcription from alternatives ist: 
first alternatiye = alternatives listo] 
# Get the 'transcript' key's value, which contains the audio's 





t text transcript on. 


transeript = first alternative transeripti] 


return transcript # return the audiots text transcription 





« r 





The recognize method returns a DetailedResponse object. Its getResult 


method returns a JSON document containing the transcribed text, which we store in 


result. The JSON will look similar to the following but depends on the question you 


ask: 


“confidence”: 0.983, 


“Final”: ‘true 


} 





“result_index": 0 


} 


The JSON contains nested dictionaries and lists. To simplify navigating this data 
structure, lines 70-85 use separate small statements to “pick off” one piece at a time 
until we get the transcribed text—"where is the closest bathroom ", which we 
then return. The boxes around portions of the JSON and the line numbers in each box 


correspond to the statements in lines 70-85. The statements operate as follows: 


e Line 70 assigns to results list the list associated with the key 'results!: 


_lick here to view code image 
results list = resulti mesa es) 


Depending on the arguments you pass to method recognize, this list may contain 
intermediate and final results. Intermediate results might be useful, for example, if 
you were transcribing live audio, such as a newscast. We asked for only final results, 


so this list contains one element. £ 


For method recognizes arguments and JSON response details, see 


_ttps://www.ibm.com/watson/developercloud/speech-to- 


e Line 73 assigns to speech recognition result the final speech-recognition 


result—the only element in results list: 


ee E 
speech recognition result = results last 0] 


e Line78 


lick here to view code image 


altermmat ivesi lrst — speech recognicron resulti ad temmacanres. | 


assigns to alternatives list the list associated with the key 
"'alternatives'. This list may contain multiple alternative transcriptions, 
depending on method recognize’s arguments. The arguments we passed result in 


a one-element list. 


e Line 81 assigns to first alternative the only element in 





alternatives list: 


lick here to view code image 


first alternative = alternatives list [0] 


e Line 85 assigns to transcript the 'transcript' key’s value, which contains the 


audio’s text transcription: 


lick here to view code image 


Ran Ss Cri pias first alternative transcript] 


e Finally, line 87 returns the audio’s text transcription. 


Lines 70-85 could be replaced with the denser statement 


lick here to view code image 


return resulti results ITO alternatives [0] )"transeriape” | 


but we prefer the separate simpler statements. 


Function translate 


To access the Watson Language Translator service, function translate (lines 89-111) 
first creates a LanguageTranslatorV3 object named language translator, 
passing as arguments the service version ('2018-05-31' ^, the API Key you set up 


earlier and the service’s URL. Lines 93—94 use the LanguageTranslatorv3 objects 


translate method to invoke the Language Translator service, passing two keyword 


arguments: 


“According to the Language Translator services API reference, '2018-05-31' is the 
current version string at the time of this writing. IBM changes the version string only if 
they make API changes that are not backward compatible. Even when they do, the 
service will respond to your calls using the API version you specify in the version string. 
For more details, see ttps://www.ibm.com/watson- 
/developercloud/language-translator/api/v3/python.html? 


ython#versioning. 


e text is the string to translate to another language. 


e model idis the predefined model that the Language Translator service will use to 
understand the original text and translate it into the appropriate language. In this 
app, mode1 will be one of IBM’s predefined translation models—'en-es' (for 


English to Spanish) or 'es-en' (for Spanish to English). 


lick here to view code image 


89 def translate (text to translate, modell): 


























90 """Use Watson Language Translator to translate English to Spanis 
91 (en-es) or Spanish to English (es-en) as specified by model." 
92 # create Watson Translator client 

93 language translator = LanguageTranslatorv3 (version='2018-05-31', 
94 iam_apikey=keys.translate key) 

95 

96 # perform the translation 

97 translated text = language translator.translate ( 

98 text text to translate, model id=model) get result () 

99 

100 # Get 'translations' list. If method translate's text argument 
101 # multiple strings, the list will have multipl ntries. We pas 
102 # one string, so the list contains only one element. 

103 translations list = translated text | translacaons” | 

104 

105 f get translations lasts only element 

106 furst cranslartion = translations List ho] 

107 

108 # get 'translation' key's value, which is the translated text 
109 translation = first ertanslationi trane tation al 

110 

TII return translation # return the translated string 








112 











The method returns a DetailedResponse. That object’s getResult method returns 
a JSON document, like: 


Line 106 





The JSON you get as a response depends on the question you asked and, again, 
contains nested dictionaries and lists. Lines 103—109 use small statements to pick off 
the translated text "¿Dónde está el baño más cercano? ". The boxes around 
portions of the JSON and the line numbers in each box correspond to the statements in 


lines 103-109. The statements operate as follows: 


e Line 103 gets the 'translations' list: 


lick here to view code image 


translations liset = translateditexti Eranslatnoms |] 


If method translate’s text argument has multiple strings, the list will have 


multiple entries. We passed only one string, so the list contains only one element. 
e Line 106 gets translations _list’s only element: 


_ lick here to view code image 


first translation = translations liseo] 


e Line 109 gets the 'translation' key’s value, which is the translated text: 


lick here to view code image 


translation = first translation translacion] 


e Line 111 returns the translated string. 


Lines 103—109 could be replaced with the more concise statement 


lick here to 


view code image 


recura translated texti werans kate uams a) Oli] translation ii 


but again, we prefer the separate simpler statements. 


Function t 


ext CO speech 


To access the Watson Text to Speech service, function text_to_ speech (lines 113- 


122) creates a TextToSpeechV1 object named tts (short for text-to-speech), passing 


as the argum 





specified by | 


ent the API key you set up earlier. The with statement opens the file 





file name and associates the file with the name audio file. The mode 


'wb' opens the file for writing (w) in binary (b) format. We'll write into that file the 


contents of the audio returned by the Speech to Text service. 


lick here to 


113 def 
114 
115 
116 
117 
118 
119 
120 
121 
122 
123 


view code image 


text to speechiltext to speak, voice to use, file name): 

"""Use Watson Text to Speech to convert text to specified voice 
and save to a WAV file.""" 

# create Text to Speech client 








tts = TextToSpeechvl (iam_apikey=keys.text to speech key) 
# open file and write the synthesized audio content ine, “thie fal 
with open(file name, “wbi) as audio file: 


audio file -write(tts:synthesizel(text to speak, 


accept='audio/wav', voice=voice to_use).get_result().cont 











Lines 121-12 


2 call two methods. First, we invoke the Speech to Text service by calling 


the Text ToSpeechvV1 object’s synthesize method, passing three arguments: 


e text to speak is the string to speak. 


e the keyword argument accept is the media type indicating the audio format the 


Speech to Text service should return—again, 'audio/wav' indicates an audio file 
in WAV format. 


e the keyword argument voice is one of the Speech to Text service’s predefined 


voices. In this app, we'll use 'en-US_AllisonVoice' to speak English text and 


'es-US SofiaVoice' to speak Spanish text. Watson provides many male and 


2 ‘ 8 
female voices across various languages. 


8 ora complete list, see 


ttps://www.ibm.com/watson/developercloud/text-to- 
peech/api/vl/python.html ?python#get-voice. Try experimenting with 


other voices. 


Watson’s DetailedResponse contains the spoken text audio file, accessible via 


get result. We access the returned file’s content attribute to get the bytes of the 





audio and pass them to the audio file object’s write method to output the bytes to 


a .wav file. 


Function record audio 


The pyaudio module enables you to record audio from the microphone. The function 
record audio (lines 124—154) defines several constants (lines 126—130) used to 
configure the stream of audio information coming from your computer’s microphone. 


We used the settings from the pyaudio module’s online documentation: 


e FRAME RATE—44100 frames-per-second represents 44.1 kHz, which is common 
for CD-quality audio. 


e CHUNK—1024 is the number of frames streamed into the program at a time. 


e FORMAT—pyaudio.palInt16 is the size of each frame (in this case, 16-bit or 2-byte 


integers). 
e CHANNELS—2 is the number of samples per frame. 


e SECONDS—5 is the number of seconds for which we'll record audio in this app. 


lick here to view code image 


124 def record _audio(file name): 














125 VUVUSe pyaudio te record 3 seconds of audio to a WAV Eile Tit 
126 FRAME RATE = 44100 # number of frames per second 

127 CHUNK = 1024 # number of frames read at a time 

128 FORMAT = pyaudio.palInt1l6é # each frame is a 16-bit (2-byte) int 
129 CHANNELS = 2 # 2 samples per frame 

130 SECONDS = 5 # total recording time 








131 


132 recorder = pyaudio.PyAudio() # opens/closes audio streams 
























































133 
134 # configure and open audio stream for recording (input=True) 
135 audio stream = recorder.open(format=FORMAT, channels=CHANNELS, 
136 rate=FRAME RATE, input=[rue, frames per buffer=CHUNK) 
137 audio frames = [] # stores raw bytes of mic input 
138 print Recording 5 seconds of audio") 
139 
140 # read 5 seconds of audio in CHUNK-sized pieces 
141 for i in range(0, int(FRAME RATE * SECONDS / CHUNK)): 
142 audio _frames.append(audio_stream.read (CHUNK) ) 
143 
144 print ('Recording complete) 
145 audio _stream.stop_ stream() # stop recording 
146 audio_stream.close() 
147 recorder.terminate() # release underlying resources used by Py. 
148 
149 # save audio frames to a WAV file 
150 with wave.open(file name, 'wb') as output tile: 
TSI output_file.setnchannels (CHANNELS) 
152 output_file.setsampwidth (recorder get sample size(FORMAT) ) 
153 output_file.setframerate (FRAME RATE) 
154 output Mile ware rrames(b. "2 gjiorni(audao. frames) 
155 
4 > 

















ine 132 creates the PyAudio object from which we'll obtain the input stream to record 


audio from the microphone. Lines 135—136 use the PyAudio object’s open method to 





open the input stream, using the constants FORMAT, CHANNELS, FRAME RATE and 
CHUNK to configure the stream. Setting the input keyword argument to True indicates 
that the stream will be used to receive audio input. The open method returns a 


pyaudio Stream object for interacting with the stream. 


Lines 141—142 use the Stream object’s read method to get 1024 (that is, CHUNK) 
frames at a time from the input stream, which we then append to the audio frames 


list. To determine the total number of loop iterations required to produce 5 seconds of 





audio using CHUNK frames at a time, we multiply the FRAME RATE by SECONDS, then 





divide the result by CHUNK. Once reading is complete, line 145 calls the St ream object’s 
stop_stream method to terminate recording, line 146 calls the St ream object’s 
close method to close the Stream, and line 147 calls the PyAudio object’s 
terminate method to release the underlying audio resources that were being used to 


manage the audio stream. 


The with statement in lines 150-154 uses the wave module’s open function to open 





the WAV file specified by file name for writing in binary format (' wb’ ). Lines 151— 


153 configure the WAV file’s number of channels, sample width (obtained from the 


PyAudio object’s get_sample_size method) and frame rate. Then line 154 writes 
the audio content to the file. The expression b''.join(audio frames) 
concatenates all the frames’ bytes into a byte string. Prepending a string with b 


indicates that it’s a string of bytes rather than a string of characters. 


Function play audio 


To play the audio files returned by Watson’s Text to Speech service, we use features of 
the pydub and pydub. playback modules. First, from the pydub module, line 158 
uses the AudioSegment class’s from_wav method to load a WAV file. The method 
returns a new AudioSegment object representing the audio file. To play the 
AudioSegment, line 159 calls the pydub. playback module’s play function, 


passing the AudioSegment as an argument. 


lick here to view code image 


156 def play _audio(file name): 


157 """Use the pydub module (pip install pydub) to play a WAV file. 
158 sound = pydub.AudioSegment.from_wav (file name) 

159 pydub.playback.play (sound) 

160 











Executing the run_translator Function 


We call the run translator function when you execute Simp leLanguage- 


Translator.py asa script: 


lick here to view code image 


Lola imane e i man ee: 


162 rün translator) 


Hopefully, the fact that we took a divide-and-conquer approach on this substantial case 
study script made it manageable. Many of the steps matched up nicely with some key 


Watson services, enabling us to quickly create a powerful mashup application. 


13.7 WATSON RESOURCES 


IBM provides a wide range of developer resources to help you familiarize yourself with 


their services and begin using them to build applications. 


atson Services Documentation 


The Watson Services documentation is at: 
ttps://console.bluemix.net/developer/watson/documentation 


For each service, there are documentation and API reference links. Each service’s 


documentation typically includes some or all of the following: 


e a getting started tutorial. 

e a video overview of the service. 

e a link toa service demo. 

e links to more specific how-to and tutorial documents. 
e sample apps. 


e additional resources, such as more advanced tutorials, videos, blog posts and more. 


Each service’s API reference shows all the details of interacting with the service using 
any of several languages, including Python. Click the Python tab to see the Python- 
specific documentation and corresponding code samples for the Watson Developer 
Cloud Python SDK. The API reference explains all the options for invoking a given 


service, the kinds of responses it can return, sample responses, and more. 


Watson SDKs 
We used the Watson Developer Cloud Python SDK to develop this chapter’s script. 


There are SDKs for many other languages and platforms. The complete list is located at: 


ttps://console.bluemix.net/developer/watson/sdks-and-tools 


Learning Resources 


On the Learning Resources page 
ttps://console.bluemix.net/developer/watson/learning-resources 


youll find links to: 


e Blog posts on Watson features and how Watson and AI are being used in industry. 


Watson’s GitHub repository (developer tools, SDKs and sample code). 
The Watson YouTube channel (discussed below). 


Code patterns, which IBM refers to as “roadmaps for solving complex programming 
challenges.” Some are implemented in Python, but you may still find the other code 


patterns helpful in designing and implementing your Python apps. 


Watson Videos 


The Watson YouTube channel 


ttps://www.youtube.com/user/IBMWatsonSolutions/ 


contains hundreds of videos showing you how to use all aspects of Watson. There are 


also spotlight videos showing how Watson is being used. 


IBM Redbooks 


The following IBM Redbooks publications cover IBM Cloud and Watson services in 


detail, helping you develop your Watson skills. 


Essentials of Application Development on IBM Cloud: 
ttp://www.redbooks.ibm.com/abstracts/sg248374.html 


Building Cognitive Applications with IBM Watson Services: Volume 1 Getting 
Started: ttp://www.redbooks.ibm.com/abstracts/sg248387.html 


Building Cognitive Applications with IBM Watson Services: Volume 2 
Conversation (now called Watson Assistant): 
ttp://www.redbooks.ibm.com/abstracts/sg248394. html 


Building Cognitive Applications with IBM Watson Services: Volume 3 Visual 
Recognition: 
ttp://www.redbooks.ibm.com/abstracts/sg248393.html 


Building Cognitive Applications with IBM Watson Services: Volume 4 Natural 
Language Classifier: 
ttp://www.redbooks.ibm.com/abstracts/sg248391.html 


Building Cognitive Applications with IBM Watson Services: Volume 5 Language 
Translator: ttp://www.redbooks.ibm.com/abstracts/sg248392.html 


e Building Cognitive Applications with IBM Watson Services: Volume 6 Speech to 
Text and Text to Speech: 
ttp://www.redbooks.ibm.com/abstracts/sg248388. html 


e Building Cognitive Applications with IBM Watson Services: Volume 7 Natural 
Language Understanding: 
ttp://www.redbooks.ibm.com/abstracts/sg248398 html 


13.8 WRAP-UP 


In this chapter, we introduced IBM’s Watson cognitive-computing platform and 
overviewed its broad range of services. You saw that Watson offers intriguing 
capabilities that you can integrate into your applications. IBM encourages learning and 
experimentation via its free Lite tiers. To take advantage of that, you set up an IBM 
Cloud account. You tried Watson demos to experiment with various services, such as 
natural language translation, speech-to-text, text-to-speech, natural language 
understanding, chatbots, analyzing text for tone and visual object recognition in images 


and video. 


You installed the Watson Developer Cloud Python SDK for programmatic access to 
Watson services from your Python code. In the traveler’s companion translation app, 
we mashed up several Watson services to enable English-only and Spanish-only 
speakers to communicate easily with one another verbally. We transcribed English and 
Spanish audio recordings to text, translated the text to the other language, then 
synthesized English and Spanish audio from the translated text. Finally, we discussed 
various Watson resources, including documentation, blogs, the Watson GitHub 
repository, the Watson YouTube channel, code patterns implemented in Python (and 
other languages) and IBM Redbooks. 


14. Machine Learning: Classification, Regression 
and Clustering 


Objectives 

In this chapter you'll: 

m Use scikit-learn with popular datasets to perform machine learning studies. 
mw Use Seaborn and Matplotlib to visualize and explore data. 


m Perform supervised machine learning with k-nearest neighbors classification and 


linear regression. 

m Perform multi-classification with Digits dataset. 

mw Divide a dataset into training, test and validation sets. 

m Tune model hyperparameters with k-fold cross-validation. 

m Measure model performance. 

m Display a confusion matrix showing classification prediction hits and misses. 
m Perform multiple linear regression with the California Housing dataset. 


m Perform dimensionality reduction with PCA and t-SNE on the Iris and Digits datasets 


to prepare them for two-dimensional visualizations. 
m Perform unsupervised machine learning with k-means clustering and the Iris dataset. 


Outline 


4.1 Introduction to Machine Learning 


4.1.1 Scikit-Learn 

4.1.2 Types of Machine Learning 

4.1.3 Datasets Bundled with Scikit-Learn 
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4.2 Case Study: Classification with k-Nearest Neighbors and the Digits Dataset, Part 1 
4.2.1 k-Nearest Neighbors Algorithm 

4.2.2 Loading the Dataset 

4.2.3 Visualizing the Data 

4.2.4 Splitting the Data for Training and Testing 

4.2.5 Creating the Model 

4.2.6 Training the Model 

4.2.7 Predicting Digit Classes 

4.3 Case Study: Classification with k-Nearest Neighbors and the Digits Dataset, Part 2 
4.3.1 Metrics for Model Accuracy 

4.3.2 K-Fold Cross-Validation 

4.3.3 Running Multiple Models to Find the Best One 

4.3.4 Hyperparameter Tuning 

4.4 Case Study: Time Series and Simple Linear Regression 

4.5 Case Study: Multiple Linear Regression with the California Housing Dataset 
4.5.1 Loading the Dataset 

4.5.2 Exploring the Data with Pandas 


4.5.3 Visualizing the Features 


4.5.4 Splitting the Data for Training and Testing 

4.5.5 Training the Model 

4.5.6 Testing the Model 

4.5.7 Visualizing the Expected vs. Predicted Prices 

4.5.8 Regression Model Metrics 

4.5.9 Choosing the Best Model 

4.6 Case Study: Unsupervised Machine Learning, Part 1—Dimensionality Reduction 
4.7 Case Study: Unsupervised Machine Learning, Part 2—k-Means Clustering 
4.7.1 Loading the Iris Dataset 

4.7.2 Exploring the Iris Dataset: Descriptive Statistics with Pandas 

4.7.3 Visualizing the Dataset with a Seaborn pairplot 

4.7.4 Using a KMeans Estimator 

4.7.5 Dimensionality Reduction with Principal Component Analysis 

4.7.6 Choosing the Best Clustering Estimator 


4.8 Wrap-Up 


14.1 INTRODUCTION TO MACHINE LEARNING 


In this chapter and the next, we'll present machine learning—one of the most exciting 
and promising subfields of artificial intelligence. You'll see how to quickly solve 
challenging and intriguing problems that novices and most experienced programmers 
probably would not have attempted just a few years ago. Machine learning is a big, 
complex topic that raises lots of subtle issues. Our goal here is to give you a friendly, 


hands-on introduction to a few of the simpler machine-learning techniques. 


What Is Machine Learning? 


Can we really make our machines (that is, our computers) learn? In this and the next 


hapter, we’ll show exactly how that magic happens. What’s the “secret sauce” of this 
new application-development style? It’s data and lots of it. Rather than programming 
expertise into our applications, we program them to learn from data. We'll present 
many Python-based code examples that build working machine-learning- models then 


use them to make remarkably accurate predictions. 


Prediction 


Wouldn’t it be fantastic if you could improve weather forecasting to save lives, 
minimize injuries and property damage? What if we could improve cancer diagnoses 
and treatment regimens to save lives, or improve business forecasts to maximize profits 
and secure people’s jobs? What about detecting fraudulent credit-card purchases and 
insurance claims? How about predicting customer “churn,” what prices houses are 
likely to sell for, ticket sales of new movies, and anticipated revenue of new products 
and services? How about predicting the best strategies for coaches and players to use to 
win more games and championships? All of these kinds of predictions are happening 


today with machine learning. 


Machine Learning Applications 


Here’s a table of some popular machine-learning applications: 


Machine learning applications 





Anomaly detection 

Chatbots Detecting objects in 
scenes 

Classifying emails 


as/spal OF NOt Detecting patterns in 


data 


Recommender systems 
spam l 
(“people who bought this 
ifvi roduct also bought ”) 
Classifying news Diagnostic medicine p ° 


articles as sports, a 
Self-Driving cars (more 


financial, politics, Facial recognition 
: generally, autonomous 
etc. 
Insurance fraud vehicles) 
Computer vision detection ; ts 
: Sentiment analysis (like 
and image 


classifying movie reviews as 


classification 


Credit-card fraud 


detection 


Customer churn 


prediction 


Data compression 


Intrusion detection in 


computer networks 
Handwriting recognition 


Marketing: Divide 


customers into clusters 


Natural language 


positive, negative or 


neutral) 
Spam filtering 


Time series predictions like 
stock-price forecasting and 


weather forecasting 


translation (English to Voice recognition 


Data exploration Spanish, French to 


Y ; Japanese, etc.) 
Data mining social 


media (like Predict mortgage loan 


Facebook, Twitter, defaults 


LinkedIn) 


14.1.1 Scikit-Learn 


We'll use the popular scikit-learn machine learning library. Scikit-learn, also called 
sklearn, conveniently packages the most effective machine-learning algorithms as 
estimators. Each is encapsulated, so you don’t see the intricate details and heavy 
mathematics of how these algorithms work. You should feel comfortable with this—you 
drive your car without knowing the intricate details of how engines, transmissions, 
braking systems and steering systems work. Think about this the next time you step 
into an elevator and select your destination floor, or turn on your television and select 
the program you’d like to watch. Do you really understand the internal workings of your 


smart phone’s hardware and software? 


With scikit-learn and a small amount of Python code, you'll create powerful models 
quickly for analyzing data, extracting insights from the data and most importantly 
making predictions. You'll use scikit-learn to train each model on a subset of your data, 
then test each model on the rest to see how well your model works. Once your models 
are trained, you'll put them to work making predictions based on data they have not 
seen. You'll often be amazed at the results. All of a sudden your computer that you’ve 


used mostly on rote chores will take on characteristics of intelligence. 


Scikit-learn has tools that automate training and testing your models. Although you can 


pecify parameters to customize the models and possibly improve their performance, in 
this chapter, we'll typically use the models’ default parameters, yet still obtain 
impressive results. There also are tools like auto-sklearn 
( ttps://automl.github.io/auto-sklearn), which automates many of the 


tasks you perform with scikit-learn. 


Which Scikit-Learn Estimator Should You Choose for Your Project 


It’s difficult to know in advance which model(s) will perform best on your data, so you 
typically try many models and pick the one that performs best. As you'll see, scikit-learn 
makes this convenient for you. A popular approach is to run many models and pick the 


best one(s). How do we evaluate which model performed best? 


You'll want to experiment with lots of different models on different kinds of datasets. 
You'll rarely get to know the details of the complex mathematical algorithms in the 
sklearn estimators, but with experience, you'll become familiar with which algorithms 
may be best for particular types of datasets and problems. Even with that experience, 
it’s unlikely that you'll be able to intuit the best model for each new dataset. So scikit- 
learn makes it easy for you to “try ’em all.” It takes at most a few lines of code for you to 
create and use each model. The models report their performance so you can compare 
the results and pick the model(s) with the best performance. 


14.1.2 Types of Machine Learning 


We'll present the two main types of machine learning—supervised machine learning, 
which works with labeled data, and unsupervised machine learning, which works with 
unlabeled data. 


If, for example, you’re developing a computer vision application to recognize dogs and 
cats, youll train your model on lots of dog photos labeled “dog” and cat photos labeled 
“cat.” If your model is effective, when you put it to work processing unlabeled photos it 
will recognize dogs and cats it has never seen before. The more photos you train with, 
the greater the chance that your model will accurately predict which new photos are 
dogs and which are cats. In this era of big data and massive, economical computer 
power, you should be able to build some pretty accurate models with the techniques 


you re about to see. 


How can looking at unlabeled data be useful? Online booksellers sell lots of books. They 
record enormous amounts of (unlabeled) book purchase transaction data. They noticed 
early on that people who bought certain books were likely to purchase other books on 


the same or similar topics. That led to their recommendation systems. Now, when you 


rowse a bookseller site for a particular book, youre likely to see recommendations 
like, “people who bought this book also bought these other books.” Recommendation 


systems are big business today, helping to maximize product sales of all kinds. 


Supervised Machine Learning 


Supervised machine learning falls into two categories—classification and regression. 
You train machine-learning models on datasets that consist of rows and columns. Each 
row represents a data sample. Each column represents a feature of that sample. In 
supervised machine learning, each sample has an associated label called a target (like 
“dog” or “cat”). This is the value youre trying to predict for new data that you present to 


your models. 


Datasets 


You'll work with some “toy” datasets, each with a small number of samples with a 
limited number of features. You'll also work with several richly featured real-world 
datasets, one containing a few thousand samples and one containing tens of thousands 
of samples. In the world of big data, datasets commonly have, millions and billions of 


samples, or even more. 


There’s an enormous number of free and open datasets available for data science 
studies. Libraries like scikit-learn package up popular datasets for you to experiment 
with and provide mechanisms for loading datasets from various repositories (such as 
openml .org). Governments, businesses and other organizations worldwide offer 
datasets on a vast range of subjects. You'll work with several popular free datasets, 


using a variety of machine learning techniques. 


Classification 


We'll use one of the simplest classification algorithms, k-nearest neighbors, to analyze 
the Digits dataset bundled with scikit-learn. Classification algorithms predict the 
discrete classes (categories) to which samples belong. Binary classification uses two 
classes, such as “spam” or “not spam” in an email classification application. Multi- 
classification uses more than two classes, such as the 10 classes, o through 9, in the 


Digits dataset. A classification scheme looking at movie descriptions might try to 


99 66 99 66 


classify them as “action,” “adventure,” “fantasy,” “romance,” “history” and the like. 


Regression 


Regression models predict a continuous output, such as the predicted temperature 


output in the weather time series analysis from hapter 10’s Intro to Data Science 


ection. In this chapter, we'll revisit that simple linear regression example, this time 


implementing it using scikit-learn’s LinearRegression estimator. Next, we use a 





LinearRegression estimator to perform multiple linear regression with the 





California Housing dataset that’s bundled with scikit-learn. We'll predict the median 
house value of a U. S. census block of homes, considering eight features per block, such 
as the average number of rooms, median house age, average number of bedrooms and 


median income. The LinearRegression estimator, by default, uses all the numerical 





features in a dataset to make more sophisticated predictions than you can with a single- 


feature simple linear regression. 


Unsupervised Machine Learning 


Next, we'll introduce unsupervised machine learning with clustering algorithms. We'll 
use dimensionality reduction (with scikit-learn’s TSNE estimator) to compress the 
Digits dataset’s 64 features down to two for visualization purposes. This will enable us 
to see how nicely the Digits data “cluster up.” This dataset contains handwritten digits 
like those the post office’s computers must recognize to route each letter to its 
designated zip code. This is a challenging computer-vision problem, given that each 
person’s handwriting is unique. Yet, we'll build this clustering model with just a few 
lines of code and achieve impressive results. And we'll do this without having to 
understand the inner workings of the clustering algorithm. This is the beauty of object- 
based programming. We'll see this kind of convenient object-based programming again 
in the next chapter, where we'll build powerful deep learning models using the open 


source Keras library. 


K-Means Clustering and the Iris Dataset 


We'll present the simplest unsupervised machine-learning algorithm, k-means 
clustering, and use it on the Iris dataset that’s also bundled with scikit-learn. We'll use 
dimensionality reduction (with scikit-learn’s PCA estimator) to compress the Iris 
dataset’s four features to two for visualization purposes. We'll show the clustering of the 
three Iris species in the dataset and graph each cluster’s centroid, which is the cluster’s 
center point. Finally, we'll run multiple clustering estimators to compare their ability to 


divide the Iris dataset’s samples effectively into three clusters. 


You normally specify the desired number of clusters, k. K-means works through the 
data trying to divide it into that many clusters. As with many machine learning 
algorithms, k-means is iterative and gradually zeros in on the clusters to match the 


number you specify. 


K-means clustering can find similarities in unlabeled data. This can ultimately help 


ith assigning labels to that data so that supervised learning estimators can then 
process it. Given that it’s tedious and error-prone for humans to have to assign labels to 
unlabeled data, and given that the vast majority of the world’s data is unlabeled, 


unsupervised machine learning is an important tool. 


Big Data and Big Computer Processing Power 


The amount of data that’s available today is already enormous and continues to grow 

exponentially. The data produced in the world in the last few years equals the amount 
produced up to that point since the dawn of civilization. We commonly talk about big 

data, but “big” may not be a strong enough term to describe truly how huge data is 


getting. 


People used to say “I’m drowning in data and I don’t know what to do with it.” With 
machine learning, we now say, “Flood me with big data so I can use machine-learning 


technology to extract insights and make predictions from it.” 


This is occurring at a time when computing power is exploding and computer memory 
and secondary storage are exploding in capacity while costs dramatically decline. All of 
this enables us to think differently about the solution approaches. We now can program 


computers to learn from data, and lots of it. It’s now all about predicting from data. 


14.1.3 Datasets Bundled with Scikit-Learn 


The following table lists scikit-learn’s bundled datasets. * It also provides capabilities 
for loading datasets from other sources, such as the 20,000+ datasets available at 


openml.org. 


l ttp://scikit-learn.org/stable/datasets/index.html. 


Datasets bundled with scikit-learn 





“Toy” datasets Real-world datasets 
Boston house prices Olivetti faces 

Iris plants 20 newsgroups text 

Diabetes Labeled Faces in the Wild face 


recognition 


Optical recognition of handwritten Forest cover types 
digits 
RCV1 


Linnerrud 
Kddcup 99 


Wine recognition 
California Housing 


Breast cancer Wisconsin (diagnostic) 


14.1.4 Steps in a Typical Data Science Study 


We'll perform the steps of a typical machine-learning case study, including: 


e loading the dataset 
e exploring the data with pandas and visualizations 


e transforming your data (converting non-numeric data to numeric data because 
scikit-learn requires numeric data; in the chapter, we use datasets that are “ready to 


go,” but we'll discuss the issue again in the “Deep Learning” chapter) 
e splitting the data for training and testing 
e creating the model 
e training and testing the model 
e tuning the model and evaluating its accuracy 


e making predictions on live data that the model hasn’t seen before. 


In the “Array-Oriented Programming with NumPy” and “Strings: A Deeper Look” 
chapters’ Intro to Data Science sections, we discussed using pandas to deal with 
missing and erroneous values. These are important steps in cleaning your data before 


using it for machine learning. 


14.2 CASE STUDY: CLASSIFICATION WITH K-NEAREST 
NEIGHBORS AND THE DIGITS DATASET, PART 1 


o process mail efficiently and route each letter to the correct destination, postal 
service computers must be able to scan handwritten names, addresses and zip codes 
and recognize the letters and digits. As you'll see in this chapter, powerful libraries like 
scikit-learn enable even novice programmers to make such machine-learning problems 
manageable. In the next chapter, we'll use even more powerful computer-vision 
capabilities when we present the deep learning technology of convolutional neural 


networks. 


Classification Problems 


In this section, we'll look at classification in supervised machine learning, which 
attempts to predict the distinct class * to which a sample belongs. For example, if you 
have images of dogs and images of cats, you can classify each image as a “dog” or a 


“cat.” This is a binary classification problem because there are two classes. 


* Note that the term class in this case means category, not the Python concept of a 


class. 


We'll use the Digits dataset ? bundled with scikit-learn, which consists of 8-by-8 
pixel images representing 1797 hand-written digits (0 through 9). Our goal is to predict 
which digit an image represents. Since there are 10 possible digits (the classes), this is a 
multi-classification problem. You train a classification model using labeled data 
—we know in advance each digit’s class. In this case study, we'll use one of the simplest 
machine-learning classification algorithms, k-nearest neighbors (k-NN), to recognize 


handwritten digits. 


3 ttp://scikit-learn.org/stable/datasets/index.html#optical- 


ecognition-of-handwritten-digits-dataset. 


The following low-resolution digit visualization of a 5 was produced with Matplotlib 
from one digit’s 8-by-8 pixel raw data. We'll show how to display images like this with 
Matplotlib momentarily: 





Researchers created the images in this dataset from the MNIST database’s tens of 
thousands of 32-by-32 pixel images that were produced in the early 1990s. At today’s 
high-definition camera and scanner resolutions, such images can be captured with 


much higher resolutions. 


Our Approach 


We'll cover this case study over two sections. In this section, we'll begin with the basic 


steps of a machine learning case study: 


e Decide the data from which to train a model. 
e Load and explore the data. 

e Split the data for training and testing. 

e Select and build the model. 

e Train the model. 


e Make predictions. 


As you'll see, in scikit-learn each of these steps requires at most a few lines of code. In 


the next section, we'll 
e Evaluate the results. 
e Tune the model. 


e Run several classification models to choose the best one(s). 


We'll visualize the data using Matplotlib and Seaborn, so launch [Python with 
Matplotlib support: 


ipython --matplotlib 


14.2.1 k-Nearest Neighbors Algorithm 


Scikit-learn supports many classification algorithms, including the simplest—k- 
nearest neighbors (k-NN). This algorithm attempts to predict a test sample’s class 
by looking at the k training samples that are nearest (in distance) to the test sample. 
For example, consider the following diagram in which the filled dots represent four 
sample classes—A, B, C and D. For this discussion, we'll use these letters as the class 


names: 








X-axis 
We want to predict the classes to which the new samples X, Y and Z belong. Let’s 


assume we'd like to make these predictions using each sample’s three nearest neighbors 


—three is k in the k-nearest neighbors algorithm: 


e Sample X’s three nearest neighbors are all class D dots, so we’d predict that X’s 


class is D. 


e Sample Y’s three nearest neighbors are all class B dots, so we’d predict that Y’s class 
is B. 


e For Z, the choice is not as clear, because it appears between the B and C dots. Of the 
three nearest neighbors, one is class B and two are class C. In the k-nearest 
neighbors algorithm, the class with the most “votes” wins. So, based on two C votes 
to one B vote, we’d predict that Z’s class is C. Picking an odd k value in the KNN 


algorithm avoids ties by ensuring there’s never an equal number of votes. 


Hyperparameters and Hyperparameter Tuning 


In machine learning, a model implements a machine-learning algorithm. In scikit- 
learn, models are called estimators. There are two parameter types in machine 


learning: 
e those the estimator calculates as it learns from the data you provide and 


e those you specify in advance when you create the scikit-learn estimator object that 


represents the model. 


The parameters specified in advance are called hyperparameters. 


In the k-nearest neighbors algorithm, k is a hyperparameter. For simplicity, we'll use 
scikit-learn’s default hyperparameter values. In real-world machine-learning studies, 
you'll want to experiment with different values of k to produce the best possible models 
for your studies. This process is called hyperparameter tuning. Later we'll use 
hyperparameter tuning to choose the value of k that enables the k-nearest neighbors 
algorithm to make the best predictions for the Digits dataset. Scikit-learn also has 


automated hyperparameter tuning capabilities. 


14.2.2 Loading the Dataset 


The load_digits function from the sklearn.datasets module returns a scikit- 
learn Bunch object containing the digits data and information about the Digits dataset 


(called metadata): 


lick here to view code image 


in [js from skliearn' datasets import load digits 


tael: droits = loadidigits 





Bunch is a subclass of dict that has additional attributes for interacting with the 


dataset. 


Displaying the Description 


The Digits dataset bundled with scikit-learn is a subset of the UCI (University of 


California Irvine) ML hand-written digits dataset at: 


ttp://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten-+ Digits 


The original UCI dataset contains 5620 samples—3823 for training and 1797 for 


testing. The version of the dataset bundled with scikit-learn contains only the 1797 


testing samples. A Bunch’s DESCR attribute contains a description of the dataset. 


According to the Digits dataset’s description 4 , each sample has 64 features (as 





specified by Number of Attributes) that represent an 8-by-8 image with pixel 





values in the range 0—1 6 (specified by Attribute Information). This dataset has 


no missing values (as specified by Missing Attribute Values). The 64 features 


may seem like a lot, but real-world datasets can sometimes have hundreds, thousands 


or even millions of features. 


4 ehighlighted some key information in bold. 


lick here to view code image 


ne (Sis prine (digits DESCR) 
_digits dataset: 








Optical recognition of handwritten digits dataset 


z*A Data Get Characteristics: 


:Number of Instances: 5620 
:Number of Attributes: 64 


:Attribute Information: 8x8 image of integer pixels in the range 


02 16: 
:Missing Attribute Values: None 
:Creator: E. Alpaydin (alpaydin '@' boun.edu. tr) 
Dates July 1998 


This is a copy of the test set of the UCI ML hand-written digits dataset 

http://archive.ics.uci.edu/ml/datasets/ 
Optical+Recognitiont+oft+tHandwritten+Digits-— 

The data set contains images of hand-written digits: 10 classes where 

each class refers to a digit. 

Preprocessing programs made available by NIST were used to extract 

normalized bitmaps of handwritten digits from a preprinted form: Krom a 


total of 43 people, 30 contributed to the training set and different 13 


to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of 


4x4 and the number of on pixels are counted in each block. This generate 





an input matrix of 8x8 wher ach element is an integer in the range 
0..16. This reduces dimensionality and gives invariance to small 
distortions. 
For info on NIST preprocessing routines, see M. D. Garris, de Lo Blue, G 
Te Candela, D. Ee Dammick, J. Geist, Be Jd. Grother, S<- A; Janet, and C. 
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469, 
1994. 
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hecking the Sample and Target Sizes 


The Bunch object’s data and target attributes are NumPy arrays: 


e The data array contains the 1797 samples (the digit images), each with 64 features, 
having values in the range 0-16, representing pixel intensities. With Matplotlib, 


we'll visualize these intensities in grayscale shades from white (0) to black (16): 


oan: TaS EFE 8B He eS n a 6 


e The target array contains the images’ labels—that is, the classes indicating which 





digit each image represents. The array is called target because, when you make 
predictions, yov’re aiming to “hit the target” values. To see labels of samples 


throughout the dataset, let’s display the target values of every 100th sample: 


lick here to view code image 


in [43 digqiis. target ss 100] 
ojone Ae wemaieciye( (Oy va a le che 2 ee ee il he Se el 2 peer SN 











We can confirm the number of samples and features (per sample) by looking at the 


data array’s shape attribute, which shows that there are 1797 rows (samples) and 64 


columns (features): 


lick here to view code image 


im, Sle digits- -data.shħape 
Our tolke (797 64 


You can confirm that the number of target values matches the number of samples by 


looking at the target array’s shape: 


lick here to view code image 


In [6]: digits.target.shape 
Our olay (17977) 


A Sample Digit Image 


Each image is two-dimensional—it has a width and a height in pixels. The Bunch object 


returned by load digits contains an images attribute—an array in which each 


element is a two-dimensional 8-by-8 array representing a digit image’s pixel intensities. 


Though the original dataset represents each pixel as an integer value from 0-16, scikit- 


learn stores these values as floating-point values (NumPy type : 


float64). For example, 





here’s the two-dimensional array representing the sample image at index 13: 


lick here to view code image 
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and here’s the image represented by this two-dimensional array—we'll soon show the 


code for displaying this image: 





Preparing the Data for Use with Scikit-Learn 


Scikit-learn’s machine-learning algorithms require samples to be stored in a two- 
dimensional array of floating-point values (or two-dimensional array-like collection, 


such as a list of lists or a pandas DataFrame): 


e Each row represents one sample. 


e Each column in a given row represents one feature for that sample. 


To represent every sample as one row, multi-dimensional data like the two-dimensional 


image array shown in snippet [7] must be flattened into a one-dimensional array. 


If you were working with a data containing categorical features (typically 
represented as strings, such as 'spam' or 'not-spam'), you’d also have to preprocess 
those features into numerical values—known as one-hot encoding, which we cover in 
the next chapter. Scikit-learn’s sklearn . preprocessing module provides 
capabilities for converting categorical data to numeric data. The Digits dataset has no 


categorical features. 


For your convenience, the Load _ digits function returns the preprocessed data ready 
for machine learning. The Digits dataset is numerical, so load_digits simply flattens 
each image’s two-dimensional array into a one-dimensional array. For example, the 8- 
by-8 array digits.images[13] shown in snippet [7] corresponds to the 1-by-64 


array digits.data[13] shown below: 


lick here to view code image 


Ta kel: digits -data T3] 
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In this one-dimensional array, the first eight elements are the two-dimensional array’s 


row O, the next eight elements are the two-dimensional array’s row 1, and so on. 


14.2.3 Visualizing the Data 


You should always familiarize yourself with your data. This process is called data 


exploration. For the digit images, you can get a sense of what they look like by 


displaying them with the Matplotlib imp1ot function. The following image shows the 


dataset’s first 24 images. To see how difficult a problem handwritten digit recognition 


is, consider the variations among the images of the 3s in the first, third and fourth 


rows, and look at the images of the 2s in the first, third and fourth rows. 
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Creating the Diagram 


Let’s look at the code that displayed these 24 digits. The following call to function 





subplots creates a 6-by-4 inch Figure (specified by the | 


figsize(6, 4) keyword 


argument) containing 24 subplots arranged in 4 rows (nrows=4) and 6 columns 


(ncols=6). Each subplot has its own Axes object, which we'll use to display one digit 


image: 


lick here to view code image 


in [9] importe matplotiib. pypl cit as piit 


In [10]: figure, axes = plt.subplots(nrows=4, ncols=6, figsize=(6, 4)) 
Function subplots returns the Axes objects in a two-dimensional NumPy array. 


Initially, the Figure appears as shown below with labels (which we'll remove) on every 


subplot’s x- and y-axes: 





0 ‘10 l Tø 


Displaying Each Image and Removing the Axes Labels 





Next, use a for statement with the built-in zip function to iterate in parallel through 
the 24 Axes objects, the first 24 images in digits.images and the first 24 values in 


digits.target: 


lick here to view code image 





In [11]: for item in zip(axes.ravel(), digits.images, digits.target): 
axes, image, target = item 
axes.imshow(image, cmap=plt.cm.gray rT) 


axes.set_ xticks([]) # remove x-axis tick marks 





axes. seta yieLelks (II) i remove y-axis tick marks 
: axes. set title(target) 
7 ple- cigit layout) 


Recall that NumPy array method ravel creates a one-dimensional view of a 
multidimensional array. Also, recall that zip produces tuples containing elements from 
the same index in each of zip’s arguments and that the argument with the fewest 


elements determines how many tuples zip returns. 


Each iteration of the loop: 


e Unpacks one tuple from the zipped items into three variables representing the 


Axes object, image and target value. 


e Calls the Axes object’s imshow method to display one image. The keyword 
argument cmap=plt.cm.gray_r determines the colors displayed in the image. 
The value plt.cm.gray_ risa color map—a group of colors that are typically 
chosen to work well together. This particular color map enables us to display the 
image’s pixels in grayscale, with o as white, 16 as black and the values in between as 
gradually darkening shades of gray. For Matplotlib’s color map names see 

ttps://matplotlib.org/examples/color/colormaps reference.html. 


Each can be accessed through the p1t . cm object or via a string, like 'gray r'. 


e Calls the Axes object’s set_ xticks and set_yticks methods with empty lists to 


indicate that the x- and y-axes should not have tick marks. 


e Calls the Axes object’s set_ title method to display the target value above the 


image—this shows the actual value that the image represents. 


After the loop, we call tight layout to remove the extra whitespace at the Figure’s 
top, right, bottom and left, so the rows and columns of digit images can fill more of the 


Figure. 


14.2.4 Splitting the Data for Training and Testing 


You typically train a machine-learning model with a subset of a dataset. Typically, the 
more data you have for training, the better you can train the model. It’s important to set 
aside a portion of your data for testing, so you can evaluate a model’s performance 
using data that the model has not yet seen. Once you're confident that the model is 


performing well, you can use it to make predictions using new data it hasn’t seen. 


We first break the data into a training set and a testing set to prepare to train and 
test the model. The function train_test_split from the 
sklearn.model_selection module shuffles the data to randomize it, then splits the 
samples in the data array and the target values in the target array into training and 
testing sets. This helps ensure that the training and testing sets have similar 
characteristics. The shuffling and splitting is performed conveniently for you by a 
ShuffleSplit object from the sklearn.model_ selection module. Function 
train test split returns a tuple of four elements in which the first two are the 
samples split into training and testing sets, and the last two are the corresponding 
target values split into training and testing sets. By convention, uppercase X is used to 


represent the samples, and lowercase y is used to represent the target values: 


lick here to view code image 


Pn) A. Erom skleann modell selection import train test splice 





in Eke xX train X test, y train, y test = train test split( 
digits. data, digits. target, random state=ilT) 


We assume the data has balanced classes—that is, the samples are divided evenly 
among the classes. This is the case for each of scikit-learn’s bundled classification 


datasets. Unbalanced classes could lead to incorrect results. 


In the “Functions” chapter, you saw how to seed a random-number generator for 
reproducibility. In machine-learning studies, this helps others confirm your results by 
working with the same randomly selected data. Function train test split 
provides the keyword argument random_ state for reproducibility. When you run the 
code in the future with the same seed value, train test split will select the same 
data for the training set and the same data for the testing set. We chose the seed value 
(11) arbitrarily. 


Training and Testing Set Sizes 


Looking at X_train’sand X_test’s shapes, you can see that, by default, 


train test split reserves 75% of the data for training and 25% for testing: 


lick here to view code image 


In [14]: X_train.shape 
Out[14]: (1347, 64) 


Im [olce a Eest Shape 
Outils]: (450; 64) 


To specify different splits, you can set the sizes of the testing and training sets with the 
train test split functions keyword arguments test _ size and train size. 
Use floating-point values from o.o through 1.0 to specify the percentages of the data to 
use for each. You can use integer values to set the precise numbers of samples. If you 
specify one of these keyword arguments, the other is inferred. For example, the 


statement 


lick here to view code image 


xX erain, X teet; y train, y test = train test spliti 
digits data, digits target, random state=lil p test size=020) 


specifies that 20% of the data is for testing, so train size is inferred to be 0. 80. 


14.2.5 Creating the Model 


The KNeighborsClassifier estimator (module sklearn.neighbors) implements 





the k-nearest neighbors algorithm. First, we create the KNeighborsClassifier 


estimator object: 
lick here to view code image 


In [16]: from sklearn.neighbors import KNeighborsClassifier 


In [17]: knn = KNeighborsClassifier() 


To create an estimator, you simply create an object. The internal details of how this 
object implements the k-nearest neighbors algorithm are hidden in the object. Youll 


simply call its methods. This is the essence of Python object-based programming. 


14.2.6 Training the Model 


Next, we invoke the KNeighborsClassi fier object’s fit method, which loads the 


sample training set (X_t rain) and target training set (y_ train) into the estimator: 


lick here to view code image 


ne Melki kini fae (XX train, y=-y eraan) 
Out ikel: 


KNeighborsClassitier(algorithm= autot, leaf size=30, metric='minkowski', 





metric params=None, n_jobs=None, n_neighbors=5, p=2, 


weights='uniform') 














For most, scikit-learn estimators, the £it method loads the data into the estimator 
then uses that data to perform complex calculations behind the scenes that learn from 
the data and train the model. The KNeighborsClassifier’s fit method just loads 











the data into the estimator, because k-NN actually has no initial learning process. The 
estimator is said to be lazy because its work is performed only when you use it to make 
predictions. In this and the next chapter, you'll use lots of models that have significant 
training phases. In the real-world machine-learning applications, it can sometimes take 
minutes, hours, days or even months to train your models. We'll see in the next chapter, 
“Deep Learning,” that special-purpose, high-performance hardware called GPUs and 


TPUs can significantly reduce model training time. 





As shown in snippet [18]’s output, the fit method returns the estimator, so [Python 
displays its string representation, which includes the estimator’s default settings. The 


n neighbors value corresponds to k in the k-nearest neighbors algorithm. By default, 





a KNeighborsClassifier looks at the five nearest neighbors to make its predictions. 


For simplicity, we generally use the default estimator settings. For 





KNeighborsClassi fier, these are described at: 


ttp: //scikit- 
earn.org/stable/modules/generated/sklearn.neighbors-.KNeighborsClassifier.html 


Many of these settings are beyond the scope of this book. In Part 2 of this case study, 


we'll discuss how to choose the best value for n neighbors. 


14.2.7 Predicting Digit Classes 





Now that we’ve loaded the data into the KNeighborsClassi fier, we can use it with 
the test samples to make predictions. Calling the estimator’s predict method with 
X test as an argument returns an array containing the predicted class of each test 


image: 


lick here to view code image 


in i9]: predicted = knn. predicr(x x test) 


Tano 





xpected = y test 


Let’s look at the predicted digits vs. expected digits for the first 20 test samples: 


lick here to view code image 


TAE 


out [2]: 


toy LAA) 2 
Ome 2. 2ah: 


predicted[:20] 
array I0, 4:7 9; 


expected[:20] 
array (10, 4, 9, 








s you can see, in the first 20 elements, only the predicted and expected arrays’ 


values at index 18 do not match. We expected a 3, but the model predicted a 5. 


Let’s use a list comprehension to locate all the incorrect predictions for the entire test 


set—that is, the cases in which the predicted and expected values do not match: 


lick here to view code image 


In [23]: wrong = Tow in zip(predicted, 





[(p, e) (Pp, e) xpected) if p != 


aA 
Out [24]: 
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The list comprehension uses zip to create tuples containing the corresponding 
elements in predicted and expected. We include a tuple in the result only if its p 
(the predicted value) and e (the expected value) differ—that is, the predicted value was 
incorrect. In this example, the estimator incorrectly predicted only 10 of the 450 test 
samples. So the prediction accuracy of this estimator is an impressive 97.78%, even 


though we used only the estimator’s default parameters. 


14.3 CASE STUDY: CLASSIFICATION WITH K-NEAREST 
NEIGHBORS AND THE DIGITS DATASET, PART 2 


In this section, we continue the digit classification case study. We'll: 


e evaluate the k-NN classification estimator’s accuracy, 


e execute multiple estimators and can compare their results so you can choose the 


best one(s), and 


e show how to tune k-NN’s hyperparameter k to get the best performance out of a 


KNeighborsClassifier. 


14.3.1 Metrics for Model Accuracy 


Once you've trained and tested a model, you'll want to measure its accuracy. Here, we'll 
look at two ways of doing this—a classification estimator’s score method anda 


confusion matrix. 


Estimator Method score 


Each estimator has a score method that returns an indication of how well the 
estimator performs for the test data you pass as arguments. For classification 


estimators, this method returns the prediction accuracy for the test data: 


lick here to view code image 
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The kNeighborsClassifier’s with its default k (thatis,n_neighbors=5) achieved 





97.78% prediction accuracy. Shortly, we’ll perform hyperparameter tuning to try to 


determine the optimal value for k, hoping that we get even better accuracy. 


Confusion Matrix 


Another way to check a classification estimator’s accuracy is via a confusion matrix, 
which shows the correct and incorrect predicted values (also known as the hits and 
misses) for a given class. Simply call the function confusion_matrix from the 
sklearn.metrics module, passing the expected classes and the predicted 


classes as arguments, as in: 


lick here to view code image 


in (AG crom sklearn metrics import confusioni matrix 


im: contusion — contusion matrix(y true—expected, y_pred=predicted) 





4 d 








The y _ true keyword argument specifies the test samples’ actual classes. People looked 
at the dataset’s images and labeled them with specific classes (the digit values). The 


y_pred keyword argument specifies the predicted digits for those test images. 


Below is the confusion matrix produced by the preceding call. The correct predictions 
are shown on the diagonal from top-left to bottom-right. This is called the principal 
diagonal. The nonzero values that are not on the principal diagonal indicate incorrect 


predictions: 


lick here to view code image 
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Each row represents one distinct class—that is, one of the digits o—9. The columns 
within a row specify how many of the test samples were classified into each distinct 


class. For example, row o0: 


lick here to view code image 


represents the digit 0 class. The columns represent the ten possible target classes 0 
through 9. Because we’re working with digits, the classes (0—9) and the row and 


column index numbers (0—9) happen to match. According to row o, 45 test samples 


were classified as the digit 0, and none of the test samples were misclassified as any of 


the digits 1 through 9. So 100% of the Os were correctly predicted. 
On the other hand, consider row 8 which represents the results for the digit 8: 


lick here to view code image 


e The 1 at column index 1 indicates that one 8 was incorrectly classified as a 1. 
e The 1 at column index 2 indicates that one 8 was incorrectly classified as a 2. 
e The 2 at column index 3 indicates that two 8s were incorrectly classified as 3s. 
e The 39 at column index 8 indicates that 39 8s were correctly classified as 8s. 


e The 1 at column index 9 indicates that one 8 was incorrectly classified as a 9. 


So the algorithm correctly predicted 88.63% (39 of 44) of the 8s. Earlier we saw that 
the overall prediction accuracy of this estimator was 97.78%. The lower prediction 
accuracy for 8s indicates that they're apparently harder to recognize than the other 


digits. 


Classification Report 


The sklearn.metrics module also provides function classification_report, 
which produces a table of classification metrics ° based on the expected and 


predicted values: 


> ttp://scikit- 


learn.org/stable/modules/model evaluation. html#precision-recall- 





nd-f-measures. 


lick here to view code image 


in (29s from sklcarn metres import classification report 





In [30]: names = Ister (digat) for digit in digits .tcargec names] 








in [Saki prant (cllassitiucatwon. report (expected, predicted, 


target names=names) ) 





precision recall fl1-score support 


0 10:0 0:0 10:0 45 

al om 9g 100 07 99 45 

2 ORAS 1030 0799 54 

3 omo5 ORS OS 44 

4 0.98 0.98 0:98 50 

5 omor 100 0799 38 

6 1.00 ZO; 1:00 42 

I om96 100 0798 45 

8 ORF, O89 oos 44 

9 0. 98 0:95 Or 96 43 

micro avg 0798 0.98 0 98 450 
macro avg 02.98 0.98 0.98 450 
weighted avg 0.2 9:8 0.98 0:98 450 


In the report: 


e precision is the total number of correct predictions for a given digit divided by the 
total number of predictions for that digit. You can confirm the precision by looking 
at each column in the confusion matrix. For example, if you look at column index 7, 
youl see 1s in rows 3 and 4, indicating that one 3 and one 4 were incorrectly 
classified as 7s and a 45 in row 7 indicating the 45 images were correctly classified 


as 7s. So the precision for the digit 7 is 45/47 or 0.96. 


e recall is the total number of correct predictions for a given digit divided by the total 
number of samples that should have been predicted as that digit. You can confirm 
the recall by looking at each row in the confusion matrix. For example, if you look at 
row index 8, you'll see three 1s and a 2 indicating that some 8s were incorrectly 
classified as other digits and a 39 indicating that 39 images were correctly classified. 


So the recall for the digit 8 is 39/44 or 0.89. 
e fi-score—This is the average of the precision and the recall. 


e support—The number of samples with a given expected value. For example, 50 


samples were labeled as 4s, and 38 samples were labeled as 5s. 


For details on the averages displayed at the bottom of the report, see: 


ttp://scikit- 


earn.org/stable/modules/generated/sklearn.metrics-.classification_report.html 


Visualizing the Confusion Matrix 


A heat map displays values as colors, often with values of higher magnitude displayed 
as more intense colors. Seaborn’s graphing functions work with two-dimensional data. 
When using a pandas DataFrame as the data source, Seaborn automatically labels its 
visualizations using the column names and row indices. Let’s convert the confusion 


matrix into a DataFrame, then graph it: 


lick here to view code image 


iim N21: import pandas as pd 


In [33]: contusion dfi = pd.Datakrame (confusion, index=range (10), 


columns=range (10) ) 


In [34]: import seaborn as sns 


in [SS axes = sns- hNeatmap (contusion dfi, annot=rrue; 


cmap= nipy spectral i) 


The Seaborn function heatmap creates a heat map from the specified DataFrame. The 
keyword argument annot=True (short for “annotation” ) displays a color bar to the 
right of the diagram, showing how the values correspond to the heat map’s colors. The 
cmap='nipy spectral _r' keyword argument specifies which color map to use. We 
used the nipy spectral _r color map with the colors shown in the heat map’s color 
bar. When you display a confusion matrix as a heat map, the principal diagonal and the 


incorrect predictions stand out nicely. 


50 


40 


30 






0 


0 





14.3.2 K-Fold Cross-Validation 


K-fold cross-validation enables you to use all of your data for both training and 
testing, to get a better sense of how well your model will make predictions for new data 
by repeatedly training and testing the model with different portions of the dataset. K- 
fold cross-validation splits the dataset into k equal-size folds (this k is unrelated to k in 
the k-nearest neighbors algorithm). You then repeatedly train your model with k — 1 
folds and test the model with the remaining fold. For example, consider using k = 10 
with folds numbered 1 through 10. With 10 folds, we’d do 10 successive training and 


testing cycles: 
e First, we’d train with folds 1—9, then test with fold 10. 
e Next, we’d train with folds 1—8 and 10, then test with fold 9. 


e Next, we’d train with folds 1-7 and 9—10, then test with fold 8. 


This training and testing cycle continues until each fold has been used to test the 


model. 


KFold Class 


Scikit-learn provides the KFold class and the cross_val_score function (both in 
the module sklearn.model_ selection) to help you perform the training and 
testing cycles described above. Let’s perform k-fold cross-validation with the Digits 


dataset and the KNeighborsClassifier created earlier. First, create a KFold object: 





lick here to view code image 


Trae [3 Sy ese EEO skilearn.model selection import KFold 





Dna We koldi KEC dlni plains N0, random state=11, shuftfle=True) 


The keyword arguments are: 
e n_splits=10, which specifies the number of folds. 


e random _state=11, which seeds the random number generator for 


reproducibility. 





e shuffle=True, which causes the KFold object to randomize the data by shuffling 
it before splitting it into folds. This is particularly important if the samples might be 
ordered or grouped. For example, the Iris dataset we'll use later in this chapter has 
150 samples of three Iris species—the first 50 are Iris setosa, the next 50 are Iris 
versicolor and the last 50 are Iris virginica. If we do not shuffle the samples, then 
the training data might contain none of a particular Iris species and the test data 


might be all of one species. 


Using the KFold Object with Function cross val score 


Next, use function cross val score to train and test your model: 


lick here to view code image 


In [38]: from skiearn.model. selection import cross val score 





Im [S39] sicores — eroso vallsgcorel(estinmator=knn, X=digits.data, 


y=digits.target, cv=kfold) 


The keyword arguments are: 


estimator=knn, which specifies the estimator you'd like to validate. 
X=digits.data, which specifies the samples to use for training and testing. 
y=digits.target, which specifies the target predictions for the samples. 


cv=kfold, which specifies the cross-validation generator that defines how to split 


the samples and targets for training and testing. 


Function cross _ val score returns an array of accuracy scores—one for each fold. As 


you can see below, the model was quite accurate. Its lowest accuracy score was 
0.97777778 (97.78%) and in one case it was 100% accurate in predicting an entire 
fold: 


lick here to view code image 


In [40]: scores 

Out[40]: 

array([0.97777778, 0.99444444, 0.98888889, 0.97777778, 0.98888889, 
0.99444444, 0.97777778, 0.98882682, 1. 7 0. 983240221) 


Once you have the accuracy scores, you can get an overall sense of the model’s accuracy 


by calculating the mean accuracy score and the standard deviation among the 10 


accuracy scores (or whatever number of folds you choose): 


lick here to view code image 


TAANE {scores.mean():.23}") 


Mean accuracy: 


print (ti Mean accuracy: 
98.72% 
deviation: {scores.std():.2%}') 


TAA 2E 


Accuracy standard deviation: 


printe (f Accuracy standard 
0.75% 


On average, the model was 98.72% accurate— 


even better than the 97.78% we achieved 


when we trained the model with 75% of the data and tested the model with 25% earlier. 


14.3.3 Running Multiple Models to Find the Best One 


It’s difficult to know in advance which machine learning model(s) will perform best for 


a given dataset, especially when they hide the 





users. Even though the KNeighborsClassiit 


details of how they operate from their 


fier predicts digit images with a high 


degree of accuracy, it’s possible that other scikit-learn estimators are even more 
accurate. Scikit-learn provides many models with which you can quickly train and test 
your data. This encourages you to run multiple models to determine which is the best 


for a particular machine learning study. 


Let’s use the techniques from the preceding section to compare several classification 


estimators—KNeighborsClassifier, SVC and GaussianNB (there are more). 





Though we have not studied the Svc and GaussianNB estimators, scikit-learn 
nevertheless makes it easy for you to test-drive them by using their default settings. i 


First, let’s import the other two estimators: 


6 To avoid a warning in the current scikit-learn version at the time of this writing 
(version 0.20), we supplied one keyword argument when creating the SVC estimator. 


This arguments value will become the default in scikit-learn version 0.22. 


lick here to view code image 


In [43]: from sklearn.svm import SVC 


In [44]: from sklearn.naive bayes import GaussianNB 


Next, let’s create the estimators. The following dictionary contains key—value pairs for 


the existing KNeighborsClassifier we created earlier, plus new SVC and 





GaussianNB estimators: 


lick here to view code image 


In [45]: estimators = { 
"KNeighborsClassifier': knn, 
"SVC': SVC (gamma='scale'), 


"GaussianNB!: GaussianNB () } 





Now, we can execute the models: 


lick here to view code image 


In [46]: for estimator name, estimator object in eStimators.1tems:() + 
ktoka =- KFoldiin splres=10; random state=1l1, shuffle=True) 
scores = cross val _score(estimator=estimator object, 


X=digits.data, y=digits.target, cv=kfold) 


PELMe Ga) (esicamawor namer 20) te u 


f'mean accuracy={scores.mean():.2%}; ' + 


f'standard deviation={scores.std():.2%}') 
KNeighborsClassifier: mean accuracy=98.72%; standard deviation=0.75% 
SVC: mean accuracy=99.00%; standard deviation=0.85% 
GaussianNB: mean accuracy=84.48%; standard deviation=3.47% 








This loop iterates through items in the estimators dictionary and for each key-value 


pair performs the following tasks: 


Unpacks the key into estimator name and value into estimator object. 


e Creates a KFold object that shuffles the data and produces 10 folds. The keyword 
argument random_ state is particularly important here because it ensures that 


each estimator works with identical folds, so we’re comparing “apples to apples.” 
e Evaluates the current estimator object using cross val _ score. 


e Prints the estimator’s name, followed by the mean and standard deviation of the 


accuracy scores’ computed for each of the 10 folds. 


Based on the results, it appears that we can get slightly better accuracy from the SVC 
estimator—at least when using the estimator’s default settings. It’s possible that by 


tuning some of the estimators’ settings, we could get even better results. The 





KNeighborsClassifier and SVC estimators’ accuracies are nearly identical so we 


might want to perform hyperparameter tuning on each to determine the best. 


Scikit-Learn Estimator Diagram 


The scikit-learn documentation provides a helpful diagram for choosing the right 
estimator, based on the kind and size of your data and the machine learning task you 


wish to perform: 


ttps://scikit-learn.org/stable/tutorial/machine_learning map/index.html 


14.3.4 Hyperparameter Tuning 


Earlier in this section, we mentioned that k in the k-nearest neighbors algorithm is a 
hyperparameter of the algorithm. Hyperparameters are set before using the algorithm 
to train your model. In real-world machine learning studies, you'll want to use 


hyperparameter tuning to choose hyperparameter values that produce the best possible 


predictions. 


To determine the best value for k in the kNN algorithm, try different values of k then 
compare the estimator’s performance with each. We can do this using techniques 
similar to comparing estimators. The following loop creates KNeighbors- 
Classifiers with odd k values from 1 through 19 (again, we use odd k values in KNN 
to avoid ties) and performs k-fold cross-validation on each. As you can see from the 
accuracy scores and standard deviations, the k value 1 in kKNN produces the most 
accurate predictions for the Digits dataset. You can also see that accuracy tends to 


decrease for higher k values: 


lick here to view code image 


Ta Wk ee om k in rangei 20 2): 


kfoltd = KFokdiin spliets=10,; random state=11, shuffle=True) 
knn = KNeighborsClassifier(n_ neighbors=k) 
scores! = cross val score (estimator—knn, 


X=digits.data, y=digits.target, cv=kfold) 
print (f'k={k:<2}; mean accuracy={scores.mean():.2%}; ' + 


f'standard deviation={scores.std():.2%}"') 























k=l ; mean accuracy=98.83%; standard deviation=0.58% 
k=3 ; mean accuracy=98.78%; standard deviation=0.78% 
k=5 ; mean accuracy=98.72%; standard deviation=0.75% 
k=7 ; mean accuracy=98.44%; standard deviation=0.96% 
k=9 ; mean accuracy=98.39%; standard deviation=0.80% 
k=11; mean accuracy=98.39%; standard deviation=0.80% 
k=13; mean accuracy=97.89%; standard deviation=0.89% 
k=15; mean accuracy=97.89%; standard deviation=1.02% 
k=17; mean accuracy=97.50%; standard deviation=1.00% 
k=19; mean accuracy=97.66%; standard deviation=0.96% 
4 > 





Machine learning is not without its costs, especially as we head toward big data and 
deep learning. You must “know your data” and “know your tools.” For example, 
compute time grows rapidly with k, because k-NN needs to perform more calculations 
to find the nearest neighbors. There is also function cross_validate, which does 


cross-validation and times the results. 


14.4 CASE STUDY: TIME SERIES AND SIMPLE LINEAR 
REGRESSION 


In the previous section, we demonstrated classification in which each sample was 


associated with a distinct class. Here, we continue our discussion of simple linear 


regression—the simplest of the regression algorithms—that began in hapter 10’s Intro 
to Data Science section. Recall that given a collection of numeric values representing an 
independent variable and a dependent variable, simple linear regression describes the 


relationship between these variables with a straight line, known as the regression line. 


Previously, we performed simple linear regression on a time series of average New York 
City January high-temperature data for 1895 through 2018. In that example, we used 
Seaborn’s regplot function to create a scatter plot of the data with a corresponding 
regression line. We also used the scipy. stats module’s linregress function to 
calculate the regression line’s slope and intercept. We then used those values to predict 


future temperatures and estimate past temperatures. 


In this section, we'll 


e use a scikit-learn estimator to reimplement the simple linear regression we showed 


in hapter 10, 


e use Seaborn’s scatterplot function to plot the data and Matplotlib’s plot 


function to display the regression line, then 


e use the coefficient and intercept values calculated by the scikit-learn estimator to 


make predictions. 


Later, we'll look at multiple linear regression (also simply called linear regression). 


For your convenience, we provide the temperature data in the ch14 examples folder in 


a CSV file named ave hi nyc jan 1895-2018.csv. Once again, launch IPython 





with the --matplotlib option: 


ipython --matplotlib 


Loading the Average High Temperatures into a DataFrame 


As we didin hapter 10, let’s load the data from ave hi nyc jan 1895-2018.csv, 





rename the 'Value' column to 'Temperature', remove 01 from the end of each 


date value and display a few data samples: 


lick here to view code image 


in (dips import pandas as pd 


fa nye =- pd read esy ave hi nye Jan Leo S701 ees) 





In ol: nyc -columns = |Date, 'Temperature', 'Anomaly'] 


In [4]: nyc.Date = nyc:-Date.floordiv (100) 


in Tsi: nye .head (3) 








out LSJ: 
Dat Temperatur Anomaly 
O- 18:95 34.2 TOR 
1896 Sa ei SLi Th 
2 T897 SOR) Sale 


Splitting the Data for Training and Testing 


In this example, we'll use the LinearRegression estimator from 
sklearn.linear_model. By default, this estimator uses all the numeric features in a 
dataset, performing a multiple linear regression (which we'll discuss in the next 
section). Here, we perform simple linear regression using one feature as the 


independent variable. So, we'll need to select one feature (the Date) from the dataset. 


When you select one column from a two-dimensional DataFrame, the result is a one- 
dimensional Series. However, scikit-learn estimators require their training and 
testing data to be two-dimensional arrays (or two-dimensional array-like data, such 
as lists of lists or pandas Dat aFrames). To use one-dimensional data with an 
estimator, you must transform it from one dimension containing n elements, into two 


dimensions containing n rows and one column as you'll see below. 


As we did in the previous case study, let’s split the data into training and testing sets. 


Once again, we used the keyword argument random_state for reproducibility: 


lick here to view code image 











In [6]: from sklearn.. model selection import train test spit 
Tera MAES Xa aan SRE SS ibys y train y test — Raman (ees Splat 
nyc.Date.values.reshape(-l, 1), nyc.Temperature.values, 


random state=11) 


The expression nyc. Date returns the Date column’s Series, and the Series’ 
values attribute returns the NumPy array containing that Series’ values. To 


transform this one-dimensional array into two dimensions, we call the array’s reshape 


method. Normally, two arguments are the precise number of rows and columns. 
However, the first argument -1 tells reshape to infer the number of rows, based on 
the number of columns (1) and the number of elements (124) in the array. The 
transformed array will have only one column, so reshape infers the number of rows to 
be 124, because the only way to fit 124 elements into an array with one column is by 


distributing them over 124 rows. 


We can confirm the 75%—25% train-test split by checking the shapes of X_train and 
X test: 


lick here to view code image 
Pn Lees x erain-shape 
ouelel (93; 1) 


Dn EO | X test: Shape 
Our Rd (Sal) 


Training the Model 


Scikit-learn does not have a separate class for simple linear regression because it’s just 





a special case of multiple linear regression, so let’s train a LinearRegression 


estimator: 


lick here to view code image 








In [10]: from sklearn.linear_ model import LinearRegression 

In [is Linear regression -= LinearRegresision () 

In [2] linear regression. fit (X-X train, yoy train) 

Omit 2 

LinearRegression (copy X=True, fit _intercept=True, n_jobs=None, 


normalize=False) 





After training the estimator, fit returns the estimator, and IPython displays its string 


representation. For descriptions of the default settings, see: 


ttp://scikit- 
earn.org/stable/modules/generated/sklearn.linear_model-.LinearRegression.html 


To find the best fitting regression line for the data, the LinearRegression estimator 





iteratively adjusts the slope and intercept values to minimize the sum of the squares of 


the data points’ distances from the line. In hapter 10’s Intro to Data Science section, 


we gave some insight into how the slope and intercept values are discovered. 


Now, we can get the slope and intercept used in the y = mx + b calculation to make 
predictions. The slope is stored in the estimator’s coeff _ attribute (m in the equation) 


and the intercept is stored in the estimator’s intercept_ attribute (b in the equation): 


lick here to view code image 
Ene AGS is ane arn regression COS my 
Ou ELSIE array (T Ono 916m 
Ene pla linear regression. Intercept 


Out MEA = 007920252656265 


We'll use these later to plot the regression line and make predictions for specific dates. 


Testing the Model 


Let’s test the model using the data in X_test and check some of the predictions 

throughout the dataset by displaying the predicted and expected values for every 

fifth element—we discuss how to assess the regression model’s accuracy in ection 
4.5.8: 


lick here to view code image 





ine [os predicted = linear regression. predict xi test) 





in [Loli expeceed = y test 


Ta M: sor p & in 2ip(predicteal s< oln expected[::5]): 
prank (& "predicted: {ps .2f), expected: {er 2E) 





predicted: 37.86, expected: 31.70 
predicted: 38.69, expected: 34.80 
predicted: 37.00, expected: 39.40 
predicted: 37.25, expected: 45.70 
predicted: 38.05, expected: 32.30 
predicted: 37.64, expected: 33.80 
predicted: 36.94, expected: 39.70 








Predicting Future Temperatures and Estimating Past Temperatures 


Let’s use the coefficient and intercept values to predict the January 2019 average high 


temperature and to estimate what the average high temperature was in January of 


1890. The lambda in the following snippet implements the equation for a line 


y =mx +b 





using the coef _as m and the intercept _ asb. 


lick here to view code image 


TAES Je predreti =n Iambda x: linear regression coet AX 


linear regression intercept ) 
TA [Oso recdiwve E162 Ou 9) 
Out[19]: array([38.84399018]) 


In 20l predict (189:0) 
Out[20]; array (T36.342464321) 


Visualizing the Dataset with the Regression Line 


Next, let’s create a scatter plot of the dataset using Seaborn’s scatterplot function 
and Matplotlib’s plot function. First, use scatterplot withthe nyc DataFrame to 
display the data points: 


lick here to view code image 














in [21 Import seaborn as sns 
In [22]: axes = sns.scatterplot(data=nyc, x='Date', y='Temperature', 
hue='Temperature', palette='winter', legend=False) 


The keyword arguments are: 


e data, which specifies the DataFrame (nyc) containing the data to display. 


e x and y, which specify the names of nyc’s columns that are the source of the data 
along the x- and y-axes, respectively. In this case, x is the 'Date' and y is the 
'Temperature'. The corresponding values from each column form x-y coordinate 


pairs used to plot the dots. 


e hue, which specifies which column’s data should be used to determine the dot 


colors. In this case, we use the 'Temperature' column. Color is not particularly 


important in this example, but we wanted to add some visual interest to the graph. 


e palette, which specifies a Matplotlib color map from which to choose the dots’ 


colors. 


e legend=False, which specifies that scatterplot should not show a legend for 


the graph—the default is True, but we do not need a legend for this example. 


As we did in hapter 10, let’s scale the y-axis range of values so you'll be able to see the 


linear relationship better once we display the regression line: 


lick here to view code image 


in? [235s axes set vy llamo A0 
out P2 Se (LOG 7/0) 


Next, let’s display the regression line. First, create an array containing the minimum 
and maximum date values in nyc. Date. These are the x-coordinates of the regression 


line’s start and end points: 


lick here to view code image 


In [24]: import numpy as np 


In [25]: x = np.array([min(nyc.Date.values), max (nyc.Date.values) }) 


Passing the array x tothe predict lambda from snippet [16] produces an array 


containing the corresponding predicted values, which we'll use as the y-coordinates: 


lick here to view code image 


In [26]: y = predict (x) 


Finally, we can use Matplotlib’s plot function to plot a line based on the x and y 


arrays, which represent the x- and y-coordinates of the points, respectively: 


lick here to view code image 


TATA e amp orate matplotlib.pyplot as plt 


a 28l: line = plte plots, y) 


The resulting scatterplot and regression line are shown below. This graph is nearly 


identical to the one you sawin hapter 10’s Intro to Data Science section. 
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Overfitting/Underfitting 


When creating a model, a key goal is to ensure that it is capable of making accurate 
predictions for data it has not yet seen. Two common problems that prevent accurate 


predictions are overfitting and underfitting: 


e Underfitting occurs when a model is too simple to make predictions, based on its 
training data. For example, you may use a linear model, such as simple linear 
regression, when in fact, the problem really requires a non-linear model. For 
example, temperatures vary significantly throughout the four seasons. If you’re 
trying to create a general model that can predict temperatures year-round, a simple 


linear regression model will underfit the data. 


e Overfitting occurs when your model is too complex. The most extreme case, would 
be a model that memorizes its training data. That may be acceptable if your new 


data looks exactly like your training data, but ordinarily that’s not the case. When 


you make predictions with an overfit model, new data that matches the training 
data will produce perfect predictions, but the model will not know what to do with 


data it has never seen. 


For additional information on underfitting and overfitting, see 





e ttps://en.wikipedia.org/wiki/Overfitting 





e ttps://machinelearningmastery.com/overfitting-and- 


nderfitting-with-machine-learning-algorithms/ 


14.5 CASE STUDY: MULTIPLE LINEAR REGRESSION 
WITH THE CALIFORNIA HOUSING DATASET 


In hapter 10’s Intro to Data Science section, we performed simple linear regression on 
a small weather data time series using pandas, Seaborn’s regplot function and the 


SciPy’s stats module’s l1inregress function. In the previous section, we 





reimplemented that same example using scikit-learn’s LinearRegression estimator, 
Seaborn’s scatterplot function and Matplotlib’s plot function. Now, we'll perform 


linear regression with a much larger real-world dataset. 


The California Housing dataset” bundled with scikit-learn has 20,640 samples, 
each with eight numerical features. We’ll perform a multiple linear regression that uses 
all eight numerical features to make more sophisticated housing price predictions than 
if we were to use only a single feature or a subset of the features. Once again, scikit- 


learn will do most of the work for you—LinearRegression performs multiple linear 





regression by default. 


7 ttp://lib.stat.cmu.edu/datasets. Pace, R. Kelley and Ronald Barry, 
Sparse Spatial Autoregressions, Statistics and Probability Letters, 33 (1997) 291-297. 
Submitted to the StatLib Datasets Archive by Kelley Pace 
( pace@unixl.sncc.lsu.edu). [9/Nov/99]. 


We'll visualize some of the data using Matplotlib and Seaborn, so launch [Python with 


Matplotlib support: 


ipython --matplotlib 


14.5.1 Loading the Dataset 


According to the California Housing Prices dataset’s description in scikit-learn, “This 
dataset was derived from the 1990 U.S. census, using one row per census block group. A 
block group is the smallest geographical unit for which the U.S. Census Bureau 
publishes sample data (a block group typically has a population of 600 to 3,000 
people).” The dataset has 20,640 samples—one per block group—with eight features 


each: 


e median income—in tens of thousands, so 8.37 would represent $83,700 

e median house age—in the dataset, the maximum value for this feature is 52 
e average number of rooms 

e average number of bedrooms 

e block population 

e average house occupancy 

e house block latitude 


e house block longitude 


Each sample also has as its target a corresponding median house value in hundreds of 
thousands, so 3.55 would represent $355,000. In the dataset, the maximum value for 


this feature is 5, which represents $500,000. 


It’s reasonable to expect that more bedrooms or more rooms or higher income would 
mean higher house value. By combining these features to make predictions, we’re more 


likely to get more accurate predictions. 


Loading the Data 


Let’s load the dataset and familiarize ourselves with it. The 
fetch_california_housing function from the sklearn.datasets module 


returns a Bunch object containing the data and other information about the dataset: 


lick here to view code image 


In [1]: from skilearn datasets import fetch california housing 


TaI california = fetehl california Nousing() 





Displaying the Dataset’s Description 


Let’s look at the dataset’s description. The DESCR information includes: 


e Number of Instances—this dataset contains 20,640 samples. 
e Number of Attributes—there are 8 features (attributes) per sample. 
e Attribute Information—feature descriptions. 


e Missing Attribute Values—none are missing in this dataset. 


According to the description, the target variable in this dataset is the median house 


value—this is the value we'll be trying to predict via multiple linear regression. 


lick here to view code image 








n ols prine (calla tonen aa DESER) 


_california_ housing dataset: 


California Housing dataset 





*xData Set Characteristics: ** 


:Number of Instances: 20640 


:Number of Attributes: 8 numeric, predictive attributes and 
the target 


:Attribute Information: 


- MedInc median income in block 

- HouseAge median house age in block 
- AveRooms average number of rooms 

- AveBedrms average number of bedrooms 
= Population block population 

- AveOccup average house occupancy 

- Latitude house block latitude 

- Longitude house block longitude 





:Missing Attribute Values: None 


This dataset was obtained from the StatLib repository. 
http://lib.stat.cmu.edu/datasets/ 


The target variable is the median house value for California districts. 





This dataset was derived from the 1990 U.S. census, using one row per ce 


It can be downloaded/loaded using the 


‘func: skilearn; datasets. fetch california housing function, 


topic:: References 


- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions, 
Statisties and Probability retters, 33) (1997) 2912297 








gain, the Bunch object’s data and target attributes are NumPy arrays containing 
the 20,640 samples and their target values respectively. We can confirm the number of 
samples (rows) and features (columns) by looking at the data array’s shape attribute, 


which shows that there are 20,640 rows and 8 columns: 


lick here to view code image 


In [4]: california.data.shape 
out lAl: (206407 8) 


Similarly, you can see that the number of target values—that is, the median house 


values—matches the number of samples by looking at the target array’s shape: 


lick here to view code image 


In [5]: california.target.shape 
OULISIE (206407) 


The Bunch’s feature_names attribute contains the names that correspond to each 


column in the data array: 


lick here to view code image 


In [6]: california.fteature names 
Outlel: 

['MedInc', 

"HouseAge', 

"AveRooms', 


"AveBedrms', 





"Population", 


"AveOccup', 


Latitude’, 





Longitude'] 


14.5.2 Exploring the Data with Pandas 


Let’s use a pandas DataFrame to explore the data further. We'll also use the 
DataFrame with Seaborn in the next section to visualize some of the data. First, let’s 


import pandas and set some options: 


lick here to view code image 





In [7]: import pandas as pd 

[oe Sik parser Opteron precision, 4) 

in 9l: pa:set option max columns: 9) 

ine [HO pd -sert opteron (C display wrdthni, None) 





In the preceding set_option calls: 


e 'precision' is the maximum number of digits to display to the right of each 


decimal point. 


e 'max columns' is the maximum number of columns to display when you output 
the DataFrame’s string representation. By default, if pandas cannot fit all of the 
columns left-to-right, it cuts out columns in the middle and displays an ellipsis ( ) 
instead. The 'max_columns' setting enables pandas to show all the columns using 
multiple rows of output. As you'll see momentarily, we'll have nine columns in the 


DataFrame—the eight dataset features in california.data and an additional 





column for the target median house values (california.target). 


e 'display.width' specifies the width in characters of your Command Prompt 
(Windows), Terminal (macOS/Linux) or shell (Linux). The value None tells pandas 
to auto-detect the display width when formatting string representations of Series 


and DataFrames. 











Next, let’s create a DataFrame from the Bunch’s data, target and feature names 
arrays. The first snippet below creates the initial DataFrame using the data in 


california.data and with the column names specified by 





california.feature names. The second statement adds a column for the median 














house values stored in california.target: 





lick here to view code image 


In [11]; calatornta di = pd. Datakrame (calitornia.data, 


columns=california.feature names) 








Ene [2 ecalitornra ant Medhousevaluc” | = pd.Series(california.target) 


| | > 











We can peek at some of the data using the head function. Notice that pandas displays 
the DataFrame’s first six columns, then skips a line of output and displays the 
remaining columns. The \ to the right of the column head "AveOccup" indicates that 
there are more columns displayed below. You'll see the \ only if the window in which 


IPython is running is too narrow to display all the columns left-to-right: 


lick here to view code image 


In [U3 ealitornia di head () 


Out hess 
MedInc HouseAge AveRooms AveBedrms Population AveOccup \ 
OF 838252 41.0 6.9841 10:23:83 82250 25556 
1 8.3014 210 62281 or 979 2401.0 2 HOYOS 
2 Rea el 520 8-2881 1-0734 496.0 278023 
3 5.6431 5270 5.8174 1.0734 558.0 2.9479 
4 3.8462 D200 6.2819 Oe 5650 Poe W85 


Latitude Longitud MedHouseValu 





0 Bi. 88 SS 4.526 
1 SERAH LP PAE Sipser) 
2 SMe) SA 4 Shoe! 
3 S85 Space) 3.413 
4 Se Ge) SA 3.422 


Let’s get a sense of the data in each column by calculating the DataFrame’s summary 
statistics. Note that the median income and house values (again, measured in hundreds 


of thousands) are from 1990 and are significantly higher today: 


lick here to view code image 


Dar [4 california df-describe() 





Out [TA]: 
MedInc HouseAge AveRooms AveBedrms Population \ 

count 20640.0000 20640.0000 20640.0000 20640.0000 20640.0000 
mean 328 107 28:63:95 5.4290 1.0967 1425.4767 
std 128998 12798656 2.4742 0.4739 1132-4621 
min 0.4999 1.0000 0.8462 OR 8833 3.0000 
25% 2.5634 18.0000 4.4407 ye 0: Or6all 787.0000 
50% 35843 29.0000 yeaa Syl 1.0488 1166.0000 


715% 4.7432 372.0000 6.0524 1070:9915 17250000 








max 1570001 520000 141.9091 34.0667 35682.0000 
AveOccup Latitude Longitud MedHouseValu 

count 20640.0000 20640.0000 20640.0000 20640.0000 

mean Sro OTON 35. 6319 SOR SCS 7, 2.0686 

std 10.3860 2. 3i610 2.0035 1.1540 

min 0.6923 325400 212473500 Oe 1500 

25% 274297 33:9300 212178000 1519160 

50% Pac 81o 34.2600 -118.4900 11 O70 

75% Cy Ss} Sle =a 18 0 1100 236472 

max 1243.3333 41.9500 ZNAT STO 5.0000 

4 | > 





14.5.3 Visualizing the Features 


It’s helpful to visualize your data by plotting the target value against each feature—in 
this case, to see how the median home value relates to each feature. To make our 
visualizations clearer, let’s use DataFrame method sample to randomly select 10% of 


the 20,640 samples for graphing purposes: 


lick here to view code image 


In [S| sample df = california df ysample(frac-0.1, random state=17) 


The keyword argument frac specifies the fraction of the data to select (0 . 1 for 10%), 





and the keyword argument random_state enables you to seed the random number 
generator. The integer seed value (17), which we chose arbitrarily, is crucial for 
reproducibility. Each time you use the same seed value, method samp1e selects the 
same random subset of the DataFrame’s rows. Then, when we graph the data, you 


should get the same results. 


Next, we'll use Matplotlib and Seaborn to display scatter plots of each of the eight 
features. Both libraries can display scatter plots. Seaborn’s are more attractive and 
require less code, so we'll use Seaborn to create the following scatter plots. First, we 
import both libraries and use Seaborn function set to scale each diagram’s fonts to two 


time their default size: 


lick here to view code image 


TAR YAS) | 28 Impor matplot ib- pyplot as piit 


In [17] imoort seaborn as sins 


in VES yess set (honitysecallie—2) 


io LO sms sete syle witnces gare.) 





The following snippet displays the scatter plots. © Each shows one feature along the x- 





axis and the median home value (california. target) along the y-axis, so we can 





see how each feature and the median house values relate to one another. We display 
each scatter plot in a separate window. The windows are displayed in the order the 
features were listed in snippet [6] with the most recently displayed window in the 


foreground: 


8 When you execute this code in IPython, each window will be displayed in front of the 


previous one. As you close each, youll see the one behind it. 


lick here to view code image 











im [ZO tor feature Sunt ealifornia. feature name si: 
plt.figure(figsize=(16, SDD) 
sns.scatterplot(data=sample df, x=feature, 

y='MedHouseValue', hue='MedHouseValue', 
palette='cool', legend=False) 


For each feature name, the snippet first creates a 16-inch-by-9-inch Matplotlib Figure 
—we’re plotting many data points, so we chose to use a larger window. If this window is 
larger than your screen, Matplotlib fits the Figure to the screen. Seaborn uses the 
current Figure to display the scatter plot. If you do not create a Figure first, Seaborn 
will create one. We created the Figure first here so we could display a large window 


for a scatter plot containing over 2000 points. 


Next, the snippet creates a Seaborn scatterplot in which the x-axis shows the 





current feature, the y-axis shows the 'MedHouseValue' (median house values), 
and the 'MedHouseValue' determines the dot colors (hue). Some interesting things 


to notice in these graphs: 


e The graphs showing the latitude and longitude each have two areas of especially 
significant density. If you search online for the latitude and longitude values where 
those dense areas appear, you'll see that these represent the greater Los Angeles and 


greater San Francisco areas where house prices tend to be higher. 


e In each graph, there is a horizontal line of dots at the y-axis value 5, which 
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epresents the median house value $500,000. The highest home value that could be 


chosen on the 1990 census form was “$500,000 or more.” ? So any block group 


with a median house value over $500,000 is listed in the dataset as 5. Being able to 


spot characteristics like this is a compelling reason to do data exploration and 


visualization. 


? ttps://www.census.gov/prod/1/90dec/cph4/appdxe. pdf. 


e Inthe HouseAge graph, there is a vertical line of dots at the x-axis value 52. The 


highest home age that could be chosen on the 1990 census form was 52, so any 


block group with a median house age over 52 is listed in the dataset as 52. 
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14.5.4 Splitting the Data for Training and Testing 


Once again, to prepare for training and testing the model, let’s break the data into 


training and testing sets using the train test split function then check their 


sizes: 


lick here to view code image 





TAE Er ON sklearn.model_ selection import train test split 


moa 2l xenon» X test, y train, y Cest = train test split( 
calitornia.data, California. target, random_state=11) 


in E23: X eran shape 
owe W231 CES 48:01 68) 


in (ZA x eest Shape 
Out [24] 2 (S160, 3) 


We used train test split’s keyword argument random state to seed the 


random number generator for reproducibility. 


14.5.5 Training the Model 





Next, we'll train the model. By default, a LinearRegression estimator uses all the 
features in the dataset’s data array to perform a multiple linear regression. An error 
occurs if any of the features are categorical rather than numeric. If a dataset contains 
categorical data, you either must preprocess the categorical features into numerical 
ones (which you'll do in the next chapter) or must exclude the categorical features from 
the training process. A benefit of working with scikit-learn’s bundled datasets is that 


they’re already in the correct format for machine learning using scikit-learn’s models. 


As you saw in the previous two snippets, X trainand xX test each contain 8 columns 





—one per feature. Let’s create a LinearRegression estimator and invoke its fit 





method to train the estimator using X_trainandy train: 


lick here to view code image 








TAAS] rom sklearn.linear model import LinearRegression 

In [26]: linear regression = LinearRegression() 

In [27)js linear regression. fit (X=xX_ train, y=y train) 

Ome Ie 

LinearRegression (copy X=True, fit _intercept=True, n_jobs=None, 


normalize=False) 


Multiple linear regression produces separate coefficients for each feature (stored in 


coeff_)in the dataset and one intercept (stored in intercept_): 














lick here to view code image 


In [28]: for a, name in enumerate (california.feature names): 





printet {namece1.0} 3 {linear regression: coer Maly 


MediInc: 0.4377030215382206 
HouseAge: 0.009216834565797713 
AveRooms: -0.10732526637360985 

AveBedrms: 0.611713307391811 

Population: —5./5682 200929 8454e=0:6 
AveOccup: —-0.0033845664657163703 
Latitude: -0.419481860964907 
Longitude: -—0.4337713349874016 


In [29]: linear regression. intercept 
OUE( 29-5 = 3688295065605547 


For positive coefficients, the median house value increases as the feature value 
increases. For negative coefficients, the median house value decreases as the feature 
value increases. Note that the population coefficient has a negative exponent (e-06), 
so the coefficient’s value is actually -0 .000005756822009298454. This is close to 


zero, so a block group’s population apparently has little effect the median house value. 


You can use these values with the following equation to make predictions: 
Y mixa E Wipe er career ma R DO 


where 


e m, Mə, , M, are the feature coefficients, 
e bis the intercept, 


e Xi, X2, , Xn are the feature values (that is, the values of the independent variables), 


and 
e yis the predicted value (that is, the dependent variable). 


14.5.6 Testing the Model 


Now, let’s test the model by calling the estimator’s predict method with the test 
samples as an argument. As we've done in each of the previous examples, we store the 


array of predictions in predicted and the array of expected values in expected: 


lick here to view code image 


En 20]: predicted =- linear regression seduce (xX test) 








inp Sie sexpecceds—= vltesic 


Let’s look at the first five predictions and their corresponding expected values: 


lick here to view code image 


i (22: predieted ai 
outes array (ie 2 53968116, 2 SAIS Or 2.08 7 GAAS. 8 OLAS A ar so SO 08S 


n [33]: expected[:5] 
OMENS Sale arcay OnE Aen les Se pele le? StS) ope le .6; |) 








«i j > 








With classification, we saw that the predictions were distinct classes that matched 
existing classes in the dataset. With regression, it’s tough to get exact predictions, 


because you have continuous outputs. Every possible value of x,, Xə x, in the calculation 
yma aes ibe Goa ac eileen o DO 
predicts a value. 


14.5.7 Visualizing the Expected vs. Predicted Prices 


Let’s look at the expected vs. predicted median house values for the test data. First, let’s 


create a DataFrame containing columns for the expected and predicted values: 


lick here to view code image 


In [34]: df = pd.DataFrame() 





In [35]: df['Expected'] = pd.Series (expected) 








Ta l6] bl Predicted" = pd.Series (predicted) 


Now let’s plot the data as a scatter plot with the expected (target) prices along the x-axis 


and the predicted prices along the y-axis: 


lick here to view code image 


in [S42 Ligure = plt.: figqure(tigsaze=(9 9J) 





In [38]: axes = sns.scatterplot(data=df, x='Expected', y='Predicted', 





hue='Predicted', palette='cool', legend=False) 


Next, let’s set the x- and y-axes’ limits to use the same scale along both axes: 


lick here to view code image 











In [39]: start = min(expected.min(), predicted.min()) 
In [40]: end = max(expected.max(), predicted.max() ) 
in [4g] aes set asimi(sitart,, end) 

Out[41]: (-0.6830978604144491, 7.155719818496834) 

kn [42] axes- Set ylim(start, end) 

Out[42]: (-0.6830978604144491, 7.155719818496834) 


Now, let’s plot a line that represents perfect predictions (note that this is not a 
regression line). The following snippet displays a line between the points representing 
the lower-left corner of the graph (start, start) and the upper-right corner of the 
graph (end, end). The third argument ('k--') indicates the line’s style. The letter k 


represents the color black, and the -- indicates that plot should draw a dashed line: 


lick here to view code image 


tansi: line r= pile plottistare,. endi (isivacte, sendal, k=) 


If every predicted value were to match the expected value, then all the dots would be 
plotted along the dashed line. In the following diagram, it appears that as the expected 
median house value increases, more of the predicted values fall below the line. So the 
model seems to predict lower median house values as the expected median house value 


increases. 
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14.5.8 Regression Model Metrics 


Scikit-learn provides many metrics functions for evaluating how well estimators predict 
results and for comparing estimators to choose the best one(s) for your particular 


study. These metrics vary by estimator type. For example, the sklearn.metrics 





functions confusion matrixand classification report used in the Digits 
dataset classification case study are two of many metrics functions specifically for 


evaluating classification estimators. 


Among the many metrics for regression estimators is the model’s coefficient of 
determination, which is also called the R? score. To calculate an estimator’s R? 
score, call the sklearn.metrics module’s r2_score function with the arrays 


representing the expected and predicted results: 


lick here to view code image 


In [44]: from sklearn import metrics 





ny FA S| metrics 32) scorelexpected, predicted) 
OWE [4S] 076008 983115964333 





R? scores range from 0.0 to 1.0 with 1.0 being the best. An R? score of 1.0 indicates that 
the estimator perfectly predicts the dependent variable’s value, given the independent 
variable(s) value(s). An R? score of 0.0 indicates the model cannot make predictions 


with any accuracy, based on the independent variables’ values. 


Another common metric for regression models is the mean squared error, which 


e calculates the difference between each expected and predicted value—this is called 


the error, 
e squares each difference and 


e calculates the average of the squared values. 


To calculate an estimator’s mean squared error, call function mean_squared_error 
(from module sklearn.metrics) with the arrays representing the expected and 


predicted results: 


lick here to view code image 








In [46]: metrics.mean squared error(expected, predicted) 
Out [46]: 0.5350149774449119 


When comparing estimators with the mean squared error metric, the one with the value 
closest to o best fits your data. In the next section, we'll run several regression 
estimators using the California Housing dataset. For the list of scikit-learn’s metrics 


functions by estimator category, see 


ttps://scikit-learn.org/stable/modules/model_ evaluation.html 


14.5.9 Choosing the Best Model 


As we did in the classification case study, let’s try several estimators to determine 


whether any produces better results than the LinearRegression estimator. In this 





example, we'll use the linear regression estimator we already created as well as 





ElasticNet, Lasso and Ridge regression estimators (all from the 


sklearn.1linear_ model module). For information about these estimators, see 


ttps://scikit-learn.org/stable/modules/linear_model.html 


lick here to view code image 











Ea ATI Erom sSklearn. linear model import ElasticNet, Lasso, Ridge 
In [48]: estimators = { 

"LinearRegression': linear regression, 

Ulta Sie ircNett: ElasticNet(), 

"Massoni Tasso 

"Ridge': Ridge() 


Once again, we'll run the estimators using k-fold cross-validation with a KFold object 
and the cross _val_score function. Here, we pass to cross_val_score the 
additional keyword argument scoring='r2', which indicates that the function 
should report the R? scores for each fold—again, 1.0 is the best, so it appears that 


LinearRegression and Ridge are the best models for this dataset: 





lick here to view code image 





eae [CAN ese aterm sklearn.model selection import KFold, cross vall score 
In [50]; Eor estimator name, estimator jobject in estimators. items): 
kfold = KFoldi(n splits=10, random state=11, shuffle=True) 
sicores: = cross val score (estimator—-estimator object, 
X=california.data, y=california.target, cev=kfold, 


scoring='r2"') 
prine (ti testimator names ors Y t 


f'mean of r2 scores={scores.mean():.3f}') 


LinearRegression: mean of r2 scores=0.599 





ElasticNet: mean of r2 scores=0.423 
Lasso: mean of r2 scores=0.285 


Ridge: mean of r2 scores=0.599 











14.6 CASE STUDY: UNSUPERVISED MACHINE 
LEARNING, PART 1—DIMENSIONALITY REDUCTION 


In our data science presentations, we’ve focused on getting to know your data. 
Unsupervised machine learning and visualization can help you do this by finding 


patterns and relationships among unlabeled samples. 


For datasets like the univariate time series we used earlier in this chapter, visualizing 


the data is easy. In that case, we had two variables—date and temperature—so we 


plotted the data in two dimensions with one variable along each axis. Using Matplotlib, 
Seaborn and other visualization libraries, you also can plot datasets with three variables 
using 3D visualizations. But how do you visualize data with more than three 
dimensions? For example, in the Digits dataset, every sample has 64 features and a 
target value. In big data, samples can have hundreds, thousands or even millions of 


features. 


To visualize a dataset with many features (that is, many dimensions), we'll first reduce 
the data to two or three dimensions. This requires an unsupervised machine learning 
technique called dimensionality reduction. When you graph the resulting 
information, you might see patterns in the data that will help you choose the most 
appropriate machine learning algorithms to use. For example, if the visualization 
contains clusters of points, it might indicate that there are distinct classes of 
information within the dataset. So a classification algorithm might be appropriate. Of 
course, you’d first need to determine the class of the samples in each cluster. This might 


require studying the samples in a cluster to see what they have in common. 


Dimensionality reduction also serves other purposes. Training estimators on big data 
with significant numbers of dimensions can take hours, days, weeks or longer. It’s also 
difficult for humans to think about data with large numbers of dimensions. This is 
called the curse of dimensionality. If the data has closely correlated features, some 
could be eliminated via dimensionality reduction to improve the training performance. 


This, however, might reduce the accuracy of the model. 


Recall that the Digits dataset is already labeled with 10 classes representing the digits 
o—9. Let’s ignore those labels and use dimensionality reduction to reduce the dataset’s 


features to two dimensions, so we can visualize the resulting data. 


Loading the Digits Dataset 
Launch IPython with: 


ipython --matplotlib 


then load the dataset: 


lick here to view code image 


In [1]: from skilearn datasets: import Load digits 


in V2 le dagaies: = koad dirigits) 





Creating a TSNE Estimator for Dimensionality Reduction 


Next, we'll use the TSNE estimator (from the sklearn.manifold module) to 
perform dimensionality reduction. This estimator uses an algorithm called t-distributed 
Stochastic Neighbor Embedding (t-SNE) ° to analyze a dataset’s features and reduce 
them to the specified number of dimensions. We first tried the popular PCA (principal 
components analysis) estimator but did not like the results we were getting, so we 
switched to TSNE. We'll show PCA later in this case study. 


°The algorithms details are beyond this books scope. For more information, see 


ttps://scikit-learn.org/stable/modules/manifold.html#t-sne. 


Let’s create a TSNE object for reducing a dataset’s features to two dimensions, as 
specified by the keyword argument n components. As with the other estimators we’ve 
presented, we used the random state keyword argument to ensure the 


reproducibility of the “render sequence” when we display the digit clusters: 


lick here to view code image 





Tal: Erom isklearnsmaniutold import “SNE 





In TAI: tsne = TSNE (n Components=2, random_state=11) 


Transforming the Digits Dataset’s Features into Two Dimensions 


Dimensionality reduction in scikit-learn typically involves two steps—training the 
estimator with the dataset, then using the estimator to transform the data into the 
specified number of dimensions. These steps can be performed separately with the 
TSNE methods fit and transform, or they can be performed in one statement using 


the fit_transform method: * 














“Every call to fit _ transform trains the estimator. If you intend to reuse the 





estimator to reduce the dimensions of samples multiple times, use fit to once train the 
estimator, then use transform to perform the reductions. Well use this technique with 


PCA later in this case study. 


lick here to view code image 


in [SJ reducedidata = tsne- ftit transtorm(aigits-data) 








TSNE’s fit transform method takes some time to train the estimator then perform 








the reduction. On our system, this took about 20 seconds. When the method completes 
its task, it returns an array with the same number of rows as digits. data, but only 


two columns. You can confirm this by checking reduced_data’s shape: 


lick here to view code image 


In [6]: reduced data.shape 
OU KO ee e E 


Visualizing the Reduced Data 


Now that we’ve reduced the original dataset to only two dimensions, let’s use a scatter 
plot to display the data. In this case, rather than Seaborn’s scatterplot function, 
we'll use Matplotlib’s scatter function, because it returns a collection of the plotted 


items. We'll use that feature in a second scatter plot momentarily: 


lick here to view code image 


im V7 it importe matp lot lab. pyplot as pit 


in [sii dots = plt.scatter (reduced datal:,;, Ol, reduced datal:, 1p 
c='black') 


Function scatter’s first two arguments are reduced_data’s columns (0 and 1) 
containing the data for the x- and y-axes. The keyword argument c='black" specifies 
the color of the dots. We did not label the axes, because they do not correspond to 
specific features of the original dataset. The new features produced by the TSNE 


estimator could be quite different from the dataset’s original features. 


The following diagram shows the resulting scatter plot. There are clearly clusters of 
related data points, though there appear to be 11 main clusters, rather than 10. There 
also are “loose” data points that do not appear to be part of specific clusters. Based on 
our earlier- study of the Digits dataset this makes sense because some digits were 
difficult to classify. 





—40 —20 0 20 40 60 


Visualizing the Reduced Data with Different Colors for Each Digit 


Though the preceding diagram shows clusters, we do not know whether all the items in 
each cluster represent the same digit. If they do not, then the clusters are not helpful. 
Let’s use the known targets in the Digits dataset to color all the dots so we can see 


whether these clusters indeed represent specific digits: 


lick here to view code image 


in [Ih “dots: — pilus cattver(reduced data: Will, reduced datal:; 1; 
c=digirts target, «cmap —plt em, Get cmap maipy speccral riy 10) 














In this case, scatter’s keyword argument c=digits.target specifies that the 


target values determine the dot colors. We also added the keyword argument 


lick here to view code image 


cmap-=plt.cm.get cmap ('nipy spectral r', 10) 


which specifies a color map to use when coloring the dots. In this case, we know we’re 
coloring 10 digits, so we use get_ cmap method of Matplotlib’s cm object (from module 
matplotlib.pyplot) toloadacolor map ('nipy spectral _r')and select 10 


distinct colors from the color map. 


The following statement adds a color bar key to the right of the diagram so you can see 


which digit each color represents: 


lick here to view code image 


im rOl colorban — plte- colorbar (dows) 


Voila! We see 10 clusters corresponding to the digits o—9. Again, there are a few smaller 
groups of dots standing alone. Based on this, we might decide that a supervised- 
learning approach like k-nearest neighbors would work well with this data. As an 
experiment, you might want to investigate Matplotlib’s Axes 3D, which provides x-, y- 


and z-axes for plotting in three-dimensional graphs. 








14.7 CASE STUDY: UNSUPERVISED MACHINE 
LEARNING, PART 2—K-MEANS CLUSTERING 


In this section, we introduce perhaps the simplest unsupervised machine learning 
algorithms—k-means clustering. This algorithm analyzes unlabeled samples and 
attempts to place them in clusters that appear to be related. The k in “k-means” 


represents the number of clusters you'd like to see imposed on your data. 


The algorithm organizes samples into the number of clusters you specify in advance, 


using distance calculations similar to the k-nearest neighbors clustering algorithm. 


ach cluster of samples is grouped around a centroid—the cluster’s center point. 
Initially, the algorithm chooses k centroids at random from the dataset’s samples. Then 
the remaining samples are placed in the cluster whose centroid is the closest. The 
centroids are iteratively recalculated and the samples re-assigned to clusters until, for 
all clusters, the distances from a given centroid to the samples in its cluster are 


minimized. The algorithm’s results are: 


e a one-dimensional array of labels indicating the cluster to which each sample 


belongs and 


e atwo-dimensional array of centroids representing the center of each cluster. 


Iris Dataset 


We'll work with the popular Iris dataset * bundled with scikit-learn, which is 
commonly analyzed with both classification and clustering. Although this dataset is 
labeled, we'll ignore those labels here to demonstrate clustering. Then, we'll use the 


labels to determine how well the k-means algorithm clustered the samples. 


*Fisher, R.A., The use of multiple measurements in taxonomic problems, Annual 
Eugenics, 7, Part II, 179-188 (1936); also in Contributions to Mathematical Statistics 
(John Wiley, NY, 1950). 


The Iris dataset is referred to as a “toy dataset” because it has only 150 samples and 
four features. The dataset describes 50 samples for each of three Iris flower species 
—Iris setosa, Iris versicolor and Iris virginica. Photos of these are shown below. Each 
sample’s features are the sepal length, sepal width, petal length and petal width, all 
measured in centimeters. The sepals are the larger outer parts of each flower that 


protect the smaller inside petals before the flower buds bloom. 





Tris setosa: 


https://commons.wikimedia.org/wiki/ File: Wild_iris_ KEFJ_(9025144383).jpg. 





Credit: Courtesy of Nation Park services. 





Iris versicolor: https://commons.wikimedia.org/wiki/Iris_versicolor# /media/ 


File:IrisVersicolor-FoxRoost-Newfoundland.jpg. 
Credit: Courtesy of Jefficus, 


https://commons.wikimedia.org/w/index.php? 
title=User:Jefficus&action=edit&redlink=1 





Iris virginica: https://commons.wikimedia.org/wiki/File:IMG_7911-Iris_virginica.jpg. 
Credit: Christer T Johansson. 


14.7.1 Loading the Iris Dataset 


Launch IPython with ipython --matplotlib, then use the sklearn.datasets 


module’s load_iris function to get a Bunch containing the dataset: 


lick here to view code image 


in [ij Eromi skilearn datasets Import dload iris 


TAE i ersi aloyeyoly Tersi) 





The Bunch’s DESCR attribute indicates that there are 150 samples (Number of 








Instances), each with four features (Number of Attributes). There are no 








missing values in this dataset. The dataset classifies the samples by labeling them with 
the integers o, 1 and 2, representing Iris setosa, Iris versicolor and Iris virginica, 


respectively. We’ll ignore the labels and let the k-means clustering algorithm try to 


determine the samples’ classes. We show some key DESCR information in bold.: 


lick here to view code image 





AISA PO ETAC EES t DES CE) 


Piirist dataset: 


Iris plants dataset 





AA Data oet iGharacrerustescs* 


:Number of Instances: 150 (50 in each of three classes) 
:Number of Attributes: 4 numeric, predictive attributes and the clas 
:Attribute Information: 
- sepal length in cm 
- sepal width in cm 
- petal length in cm 
- petal width in cm 
= elas: 
- Iris-Setosa 
- Iris-Versicolour 


= brs VERI aincee: 


:Summary Stavustie's: 














Min Max Mean SD Class Correlation 
sepal length: oie hee, 5.84 0.83 0.7826 
sepal width: Ze Ou Avera 3705 O43 -0.4194 
petal length: EO OER 3.156 G 0.9490 (high!) 
petal width: Oeil 225 120 ONT 0719563 (high!) 








:Missing Attribute Values: None 


:Class Distribution: 33.3% for each of 3 classes. 





:Creator: R.A. Fisher 
:Donor: Michael Marshall (MARSHALL? PLU@io.arc.nasa.gov) 
:Date: July, 1988 


The famous Iris database, first used by Sir R.A. Fisher. The dataset is 


the UCI Machine Learning Repository, which has two wrong data points. 


This is perhaps the best known database to be found in the pattern 








recognition literature. Fisher's paper is a classic in the field and 
is referenced frequently to this day. (See Duda & Hart, for example.) 
refers to a type of iris plant. One class is linearly separable from th 





other 2; the latter are NOT linearly separable from each other. 





topic:: References 


- Fisher, R.A. "The use of multiple measurements in taxonomic 





problems" 
Annual, Eugenics, 7, Pare mip 179-138 (1936); also: an MiGomipiceisbuitacon's 
to Mathematical Statistics” (John Wiley, NY, IOS Oi 








— Duda, ROl; & Hart, È- E: (T97) Pattern Classification and Scene 
Analysis. 
(Q327.D83) John Wiley & Sons. ISBN O= 42 23 6k ake Ss page 218. 

- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New Syste 





Structure and Classification Rule for Recognition in PBartraliy 

















Exposed Environments". IEEE Transactions on Pattern Analysis and 
Machine Intelligence, Vol. PAMI-2, No. 1, 67-71. 
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE 





























Transactions on Information Theory, May 1972, 431-433. 
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS 





II conceptual clustering system finds 3 classes in the data. 


= Many, many more 











4 | ] > 








hecking the Numbers of Samples, Features and Targets 


You can confirm the number of samples and features per sample via the data array’s 


shape, and you can confirm the number of targets via the target array’s shape: 


lick here to view code image 


In [4]: iris.data.shape 
Out [4]: (150, 4) 

In [5]: iris.target.shape 
Ouse Sale (E50) 


The array target names contains the names for the target array’s numeric labels 
—dtype='<U10' indicates that the elements are strings with a maximum of 10 


characters: 


lick here to view code image 


im lels iris- -target names 





Gutlol: array (I setosa! Tversicolor:, virginica i, dtype='<U10'!') 


The array feature names contains a list of string names for each column in the data 


array: 


lick here to view code image 


In [7]? tris. feature names 

Cut]: 

["sepal length (cm)', 
"sepal width (cm)', 
"petal length (cm)', 
“petal width Gem)" ] 


14.7.2 Exploring the Iris Dataset: Descriptive Statistics with Pandas 


Let’s use a DataFrame to explore the Iris dataset. As we did in the California Housing 


case study, let’s set the pandas options for formatting the column-based outputs: 
lick here to view code image 
tonel: ampore pandas as pd 


moiol: pd-set option (“max columnist, 5) 


Ta Mol passet joptaon (display. width None) 


Create a DataFrame containing the data array’s contents, using the contents of the 


feature names array as the column names: 


lick here to view code image 
Tma [ios reais df = ode Dacbakrame(aras data, columns=ir isi teature names) 


Next, add a column containing each sample’s species name. The list comprehension in 
the following snippet uses each value in the target array to look up the corresponding 


species name in the target_names array: 


lick here to view code image 


TAR 2A sedr is syereteuuets | Hiris target inamesii ii tror r imn ipio searg 


4] l > 














et’s use pandas’ to look at a few samples. Once again notice that pandas displays a \ to 


the right of the column heads to indicate that there are more columns displayed below: 


lick here to view code image 


Pn lee alas oak she adi()) 
owe MSNi: 


sepal length (cm) sepal width (cm) petal length (cm) \ 
0 gen 35 1.4 
al 4.9 32.0 WA 
2 4T 32 S 
3 4.6 Sal ie) 
4 5.0 3.6 1.4 
petal width (cm) species 
0 0.2 setosa 
il 0.2 setosa 
2 0.2 setosa 
3 0.2 setosa 
4 0.2 setosa 
Let’s calculate some descriptive statistics for the numerical columns: 
lick here to view code image 
in [ae pa set option (i precisant, 2) 
diate (S| Be rS de deS Cr ibe) 
Out MESI: 
sepal length (cm) sepal width (cm) petal length (cm) N 
count 15000 150700 150100 
mean 5.84 3.06 S16 
std 0:83 0.44 a. 
min 4.30 200 EKOO) 
25% SrO 280 1.60 
50% Breen” 3710/0 4.35 
75% 6.40 330 SaO) 
max Te 90 4.40 6.90 
petal width (cm) 
count 150100 
mean 120 
std OTS 
min om io 
25% 030 
50% 10 
75% 18O 
max 250 


Calling the describe method on the 'species' column confirms that it contains 
three unique values. Here, we know in advance of working with this data that there are 
three classes to which the samples belong, though this is not always the case in 


unsupervised machine learning. 


lick here to view code image 


aio ESIS 
Outs ES: ]\2 


count 
unique 
top 
freq 


Name: 





iris dii- species | adeseribe H) 


T50 

3 
setosa 
50 


species, dtype: object 


14.7.3 Visualizing the Dataset with a Seaborn pairplot 


Let’s visualize the features in this dataset. One way to learn more about your data is to 


see how the features relate to one another. The dataset has four features. We cannot 


graph one against the other three in a single graph. However, we can plot pairs of 


features against one another. Snippet [20] uses Seaborn function pairplot to create 


a grid of graphs plotting each feature against itself and the other specified features: 


lick here to view code image 


e, 
ne [aes 
Iie LIES 
La 20 








: import seaborn as sns 


Sones Seb (home yscalte =a 1) 


Sens Seb otele ews wegesd,") 





S Grid = sns- pairplot(data=iris df, yars-iris df-columnsilo A], 


hue='species') 


The keyword arguments are: 


e data—The DataFrame ° containing the data to plot. 


3This also may be a two-dimensional array or list. 


e vars—A sequence containing the names of the variables to plot. For a DataFrame, 


these are the names of the columns to plot. Here, we use the first four DataFrame 


columns, representing the sepal length, sepal width, petal length and petal width, 


respectively. 


e hue—The DataFrame column that’s used to determine colors of the plotted data. 


In this case, we'll color the data by Iris species. 


he preceding call to pairplot produces the following 4-by-4 grid of graphs: 
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The graphs along the top-left-to-bottom-right diagonal, show the distribution of just 


the feature plotted in that column, with the range of values (left-to-right) and the 


number of samples with those values (top-to-bottom). Consider the sepal-length 


distributions: 





4 


6 8 
sepal length (cm) 


The tallest shaded area indicates that the range of sepal length values (shown along the 


x-axis) for Iris setosa is approximately 4—6 centimeters and that most Iris setosa 


samples are in the middle of that range (approximately 5 centimeters). Similarly, the 


rightmost shaded area indicates that the range of sepal length values for Iris virginica 


is approximately 4—8.5 centimeters and that the majority of Iris virginica samples have 


sepal length values between 6 and 7 centimeters. 


The other graphs in a column show scatter plots of the other features against the 


feature on the x-axis. In the first column, the other three graphs plot the sepal width, 


petal length and petal width, respectively, along the y-axis and the sepal length along 


the x-axis. 


When you run this code, you'll see in the full color output that using separate colors for 
each Iris species shows how the species relate to one another on a feature-by-feature 
basis. Interestingly, all the scatter plots clearly separate the Iris setosa blue dots from 
the other species’ orange and green dots, indicating that Iris setosa is indeed in a “class 
by itself.” We also can see that the other two species can sometimes be confused with 
one another, as indicated by the overlapping orange and green dots. For example, if you 
look at the scatter plot for sepal width vs. sepal length, you'll see the Iris versicolor and 
Tris virginica dots are intermixed. This indicates that it would be difficult to distinguish 


between these two species if we had only the sepal measurements available to us. 


Displaying the pairplot in One Color 


If you remove the hue keyword argument, pairplot function uses only one color to 


plot all the data because it does not know how to distinguish the species: 


lick here to view code image 


In [2iie grid — sns- pairplot(dataziris df; vars=iris df.columns[0:4]) 


As you can see in the resulting pair plot on the next page, in this case, the graphs along 
the diagonal are histograms showing the distributions of all the values for that feature, 
regardless of the species. As you study each scatter plot, it appears that there may be 
only two distinct clusters, even though for this dataset we know there are three species. 
If you do not know the number of clusters in advance, you might ask a domain expert 
who is thoroughly familiar with the data. Such a person might know that there are three 
species in the dataset, which would be valuable information as we try to perform 


machine learning on the data. 


The pairplot diagrams work well for a small number of features or a subset of 
features so that you have a small number of rows and columns, and for a relatively 
small number of samples so you can see the data points. As the number of features and 
samples increases, each scatter plot quickly becomes too small to read. For larger 
datasets, you may choose to plot a subset of the features and potentially a randomly 


selected subset of the samples to get a feel for your data. 
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14.7.4 Using a KMeans Estimator 


In this section, we'll use k-means clustering via scikit-learn’s KMeans estimator (from 
the sklearn.cluster module) to place each sample in the Iris dataset into a cluster. 
As with the other estimators you've used, the KMeans estimator hides from you the 


algorithm’s complex mathematical details, making it straightforward to use. 


Creating the Estimator 


Let’s create the KMeans object: 


lick here to view code image 


In [22]: from sklearn.cluster import KMeans 


In [23]: kmeans = KMeans(n_clusters=3, random_state=11) 


The keyword argument n_clusters specifies the k-means clustering algorithm’s 
hyperparameter k, which KMeans requires to calculate the clusters and label each 
sample. When you train a KMeans estimator, the algorithm calculates for each cluster a 


centroid representing the cluster’s center data point. 


The default value for the n_ clusters parameter is 8. Often, you'll rely on domain 
experts knowledgeable about the data to help choose an appropriate k value. However, 
with hyperparameter tuning, you can estimate the appropriate k, as we'll do later. In 
this case, we know there are three species, so we'll use n_clusters=3 to see how well 
KMeans does in labeling the Iris samples. Once again, we used the random state 


keyword argument for reproducibility. 


Fitting the Model 


Next, we'll train the estimator by calling the KMeans object’s fit method. This step 


performs the k-means algorithm discussed earlier: 


lick here to view code image 


rn [24 kmeans Eit (Iris- data) 

Gut P24: 

KMeans (algorithm—"auto", copy x=—True, indt—"k-meanst+", Max iter=300, 
m clusters=3, n init-l0, n jobs-None;, precompute distances='auto', 


random state=11, tol=0.0001, verbose=0) 











As with the other estimator’s, the fit method returns the estimator object and [Python 


displays its string representation. You can see the KMeans default arguments at: 


ttps://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.h 








di = = = x 











hen the training completes, the KMeans object contains: 


e Alabels_ array with values from 0 ton clusters - 1 (inthis example, 0-2), 


indicating the clusters to which the samples belong. 


e Acluster centers array in which each row represents a centroid. 


Comparing the Computer Cluster Labels to the Iris Dataset’s Target 
Values 

Because the Iris dataset is labeled, we can look at its target array values to get a sense 
of how well the k-means algorithm clustered the samples for the three Iris species. With 
unlabeled data, we’d need to depend on a domain expert to help evaluate whether the 


predicted classes make sense. 


In this dataset, the first 50 samples are Iris setosa, the next 50 are Iris versicolor, and 
the last 50 are Iris virginica. The Iris dataset’s target array represents these with the 
values 0-2. If the KMeans estimator chose the clusters perfectly, then each group of 50 
elements in the estimator’s labels_ array should have a distinct label. As you study 
the results below, note that the KMeans estimator uses the values o through k — 1 to 


label clusters, but these are not related to the Iris dataset’s target array. 


Let’s use slicing to see how each group of 50 Iris samples was clustered. The following 


snippet shows that the first 50 samples were all placed in cluster 1: 


lick here to view code image 


ine) [25s print (kmeansi labels T0501) 
(a ak al E a a a a a a a a a a a a a a aL ate aD aL ak aly al 
IE Eb, al aa ae a ay ale al ab 











The next 50 samples should be placed into a second cluster. The following snippet 


shows that most were placed in cluster 0, but two samples were placed in cluster 2: 


lick here to view code image 


Eng Aoi print (kmeans labels 50:5 T00 
[O02 050 O50 0 ONO O00 20000 OFOFOT OOS OS0. OOF O02 FOOT 0 FO 00S 00 
OO 1010 VO 202050000 0F10 204] 











Similarly, the last 50 samples should be placed into a third cluster. The following 
snippet shows that many of these samples were placed in cluster 2, but 14 of the 
samples were placed in cluster 0, indicating that the algorithm thought they belonged 


to a different cluster: 


lick here to view code image 


io VA | printi(kmeans: labels roo: T50 
PAOA A Z) te ha oy R O 10) es, OE AO) 0) Pe ON es (0) 
225 O22 0) 2 225025520) 











The results of these three snippets confirm what we saw in the pairplot diagrams 


earlier in this section—that Iris setosa is “in a class by itself” and that there is some 


confusion between Tris versicolor and Iris virginica. 


14.7.5 Dimensionality Reduction with Principal Component 
Analysis 


Next, we'll use the PCA estimator (from the sklearn.decomposition module) to 
perform dimensionality reduction. This estimator uses an algorithm called principal 
component analysis * to analyze a dataset’s features and reduce them to the specified 
number of dimensions. For the Iris dataset, we first tried the TSNE estimator shown 
earlier but did not like the results we were getting. So we switched to PCA for the 


following demonstration. 


4The algorithms details are beyond this books scope. For more information, see 


ttps://scikit-learn.org/stable/modules/decomposition.html#pca. 


Creating the PCA Object 


Like the TSNE estimator, a PCA estimator uses the keyword argument n components 


to specify the number of dimensions: 
lick here to view code image 


Ene PAS teem sklearn.decomposition import PCA 


in [29> pea =- PCA(n components=2, random _state=11) 


Transforming the Iris Dataset’s Features into Two Dimensions 


Let’s train the estimator and produce the reduced data by calling the PCA estimator’s 


methods fit and transform methods: 


lick here to view code image 


In SOs pea tie(aras data) 





Outo: 

PCA(copy=True, iterated _power='auto', n_components=2, random _state=11, 
svd_solver="auto’, tol=-0.0, whiten=False) 

in [Sj iris pead = pea transform (rs data) 








4 > 





When the method completes its task, it returns an array with the same number of rows 


as iris.data, but only two columns. Let’s confirm this by checking iris pca’s 


shape: 


lick here to view code image 


weI: eras pean shape 
outo ZR (GS 2) 





Note that we separately called the PCA estimator’s fit and transform methods, 

















rather than fit transform, which we used with the TSNE estimator. In this example, 








were going to reuse the trained estimator (produced with fit) to perform a second 





transform to reduce the cluster centroids from four dimensions to two. This will 


enable us to plot the centroid locations on each cluster. 


Visualizing the Reduced Data 


Now that we’ve reduced the original dataset to only two dimensions, let’s use a scatter 
plot to display the data. In this case, we'll use Seaborn’s scatterplot function. First, 
let’s transform the reduced data into a DataFrame and add a species column that we'll 


use to determine the dot colors: 


lick here to view code image 


In [S33] iris pea df - pd. Davakrame (iris pea; 
columns=['Componentl', 'Component2']) 
Py [s4 | iris pea dfi species” | = iris df. species 
ıl > 











Next, let’s scatterplot the data in Seaborn: 


lick here to view code image 


in [SS axes = snsescacterpllou(davtaq= tris pea di, x—NCemponentl 


y='Component2', hue='species', legend='brief', 





palette='cool') 


Each centroid in the KMeans object’s cluster centers array has the same number 
of features as the original dataset (four in this case). To plot the centroids, we must 
reduce their dimensions. You can think of a centroid as the “average” sample in its 


cluster. So each centroid should be transformed using the same PCA estimator we used 


to reduce the other samples in that cluster: 


lick here to view code image 
ENA [ESIGN iris Cemzens a — pca.transform(kmeans.cluster centers _ ) 


Now, we'll plot the centroids of the three clusters as larger black dots. Rather than 
transform the iris centers array into a DataFrame first, let’s use Matplotlib’s 


scatter function to plot the three centroids: 
lick here to view code image 


ny WPS | amp ort matelot lib- pyplot as pit 


toles: dots = plte seatter liris eentersi 7 0l urilsecenterms i ar OF 
s=100, c='k"') 


The keyword argument s=100 specifies the size of the plotted points, and the keyword 
argument c='k' specifies that the points should be displayed in black. 
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14.7.6 Choosing the Best Clustering Estimator 


As we did in the classification and regression case studies, let’s run multiple clustering 
algorithms and see how well they cluster the three species of Iris flowers. Here we'll 
attempt to cluster the Iris dataset’s samples using the kmeans object we created 
earlier ° and objects of scikit-learn’s DBSCAN, MeanShift, SpectralClustering 
and AgglomerativeClustering estimators. Like KMeans, you specify the number 
of clusters in advance for the SpectralClustering and 


AgglomerativeClustering estimators: 


°Were running KMeans here on the small Iris dataset. If you experience performance 
problems with KMeans on larger datasets, consider using the MiniBatchKMeans 
estimator. The scikit-learn documentation indicates that MiniBatchKMeans is faster 


on large datasets and the results are almost as good. 


lick here to view code image 


In [39]: from sklearn.cluster import DBSCAN, MeanShift, \ 
SpectralClustering, AgglomerativeClustering 





In [40]: estimators = { 
"KMeans': kmeans, 
"DBSCAN "= DBSCAN(), 
"MeanShift': MeanShift(), 


"SpectralClustering': SpectralGlustering (mn clusters—3)); 








"AgglomerativeClustering': 











AgglomerativeClustering(n_clusters=3) 





Each iteration of the following loop calls one estimator’s fit method with iris.data 
as an argument, then uses NumPy’s unique function to get the cluster labels and 
counts for the three groups of 50 samples and displays the results. Recall that for the 


DBSCAN and MeanShi ft estimators, we did not specify the number of clusters in 





advance. Interestingly, DBSCAN correctly predicted three clusters (labeled -1, 0 and 1), 
though it placed 84 of the 100 Iris virginica and Iris versicolor samples in the same 


cluster. The MeanShift estimator, on the other hand, predicted only two clusters 





(labeled as 0 and 1), and placed 99 of the 100 Iris virginica and Iris versicolor samples 


in the same cluster: 


lick here to view code image 


In [41]: import numpy as np 


n [42]: for name, estimator in estimators.items(): 








estimates. fase (iris data) 


print (f'\n{name}:') 


Eon i any Cange, T01, 50): 











labels, counts = np.unique ( 
estimator- Labels rir o0), return counts=True) 
print iE hal) pa (eke 0) ee 4) 











for label, count in zap (labels;, counts): 
printe! label={label}, count={count}') 

KMeans: 
O= 50K 

label=1, count=50 
50= 11010 

label=0, count=48 

label=2, count=2 
TOOTIS 

label=0, count=14 

label=2, count= 
DBSCAN: 
O= 50: 

label=-1, count=1 

label=0, count=49 
SiO = POON 

label=-1, count=6 

label=1, count=44 
LOO SSO 

label=-1, count=10 

label=1, count=40 
MeanShift: 
O=5i0% 

label=1, count=50 
3) 0) al OOK 

label=0, count=49 

label=1, count= 
100-150: 

label=0, count=50 
SpectralClustering: 
0-50: 

label=2, count=50 
30 =1:0'0:: 

label=1, count=50 
100-150: 

label=0, count= 

label=1, count= 











AgglomerativeClustering: 


0-50: 
label=1, 


count=50 


S000 
label=0, count=49 
label=2, count=1 
LOO] TSO. 
label=0, count=15 
label=2, count=35 


Though these algorithms label every sample, the labels simply indicate the clusters. 
What do you do with the cluster information once you have it? If your goal is to use the 
data in supervised machine learning, typically you’d study the samples in each cluster 
to try to determine how they’re related and label them accordingly. As we'll see in the 
next chapter, unsupervised learning is commonly used in deep-learning applications. 
Some examples of unlabeled data processed with unsupervised learning include tweets 
from Twitter, Facebook posts, videos, photos, news articles, customers’ product 


reviews, viewers’ movie reviews and more. 


14.8 WRAP-UP 


In this chapter we began our study of machine learning, using the popular scikit-learn 
library. We saw that machine learning is divided into two types. Supervised machine 
learning, which works with labeled data and unsupervised machine learning which 
works with unlabeled data. Throughout this chapter, we continued emphasizing 


visualizations using Matplotlib and Seaborn, particularly for getting to know your data. 


We discussed how scikit-learn conveniently packages machine-learning algorithms as 
estimators. Each is encapsulated so you can create your models quickly with a small 
amount of code, even if you don’t know the intricate details of how these algorithms 


work. 


We looked at supervised machine learning with classification, then regression. We used 
one of the simplest classification algorithms, k-nearest neighbors, to analyze the Digits 
dataset bundled with scikit-learn. You saw that classification algorithms predicts the 
classes to which samples belong. Binary classification uses two classes (such as “spam” 
or “not spam”) and multi-classification uses more than two classes (such as the 10 


classes in the Digits dataset). 


We performed the steps of a typical machine-learning case study, including loading the 
dataset, exploring the data with pandas and visualizations, splitting the data for 
training and testing, creating the model, training the model and making predictions. 
We discussed why you should partition your data into a training set and a testing set. 


You saw ways to evaluate a classification estimator’s accuracy via a confusion matrix 


nd a classification report. 


We mentioned that it’s difficult to know in advance which model(s) will perform best 
on your data, so you typically try many models and pick the one that performs best. We 
showed that it’s easy to run multiple estimators. We also used hyperparameter tuning 
with k-fold cross-validation to choose the best value of k for the k-NN algorithm. 


We revisited the time series and simple linear regression example from hapter 10’s 
Intro to Data Science section, this time implementing it using a scikit-learn 


LinearRegression estimator. Next, we used a LinearRegression estimator to 








perform multiple linear regression with the California Housing dataset that’s bundled 


with scikit-learn. You saw that the LinearRegression estimator, by default, uses all 





the numerical features in a dataset to make more sophisticated predictions than you 
can with simple linear regression. Again, we ran multiple scikit-learn estimators to 


compare how they performed and choose the best one. 


Next, we introduced an unsupervised machine learning and mentioned that it’s 
typically accomplished with clustering algorithms. We used introduced dimensionality 
reduction (with scikit-learn’s TSNE estimator) and used it to compress the Digits 
dataset’s 64 features down to two for visualization purposes. This enabled us to see the 


clustering of the digits data. 


We presented one of the simplest unsupervised machine learning algorithms, k-means 
clustering, and demonstrated clustering on the Iris dataset that’s also bundled with 
scikit-learn. We used dimensionality reduction (with scikit-learn’s PCA estimator) to 
compress the Iris dataset’s four features to two for visualization purposes to show the 
clustering of the three Jris species in the dataset and their centroids. Finally, we ran 
multiple clustering estimators to compare their ability to label the Iris dataset’s samples 


into three clusters. 


In the next chapter, we'll continue our study of machine learning technologies with 


deep learning. We'll tackle some fascinating and challenging problems. 


15. Deep Learning 


Objectives 

In this chapter you'll: 

m Understand what a neural network is and how it enables deep learning. 

m Create Keras neural networks. 

m Understand Keras layers, activation functions, loss functions and optimizers. 


m Use a Keras convolutional neural network (CNN) trained on the MNIST dataset to 


recognize handwritten digits. 


m Use a Keras recurrent neural network (RNN) trained on the IMDb dataset to perform 


binary classification of positive and negative movie reviews. 
m Use TensorBoard to visualize the progress of training deep-learning networks. 
m Learn which pretrained neural networks come with Keras. 


m Understand the value of using models pretrained on the massive ImageNet dataset for 


computer vision apps. 
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5.12 Wrap-Up 


15.1 INTRODUCTION 


One of AI’s most exciting areas is deep learning, a powerful subset of machine learning that 
has produced impressive results in computer vision and many other areas over the last few 
years. The availability of big data, significant processor power, faster Internet speeds and 
advancements in parallel computing hardware and software are making it possible for more 


organizations and individuals to pursue resource-intensive deep-learning solutions. 


Keras and TensorFlow 


In the previous chapter, Scikit-learn enabled you to define machine-learning models 
conveniently with one statement. Deep learning models require more sophisticated setups, 
typically connecting multiple objects, called layers. We'll build our deep learning models 
with Keras, which offers a friendly interface to Google’s TensorFlow—the most widely 
used deep-learning library. * Francois Chollet of the Google Mind team developed Keras to 
make deep-learning capabilities more accessible. His book Deep Learning with Python is a 


must read. * Google has thousands of TensorFlow and Keras projects underway internally 


nd that number is growing quickly. 3 , 4 


* Keras also serves as a friendlier interface to Microsofts CNTK and the Université de 
Montréals Theano- (which ceased development in 2017). Other popular deep learning 
frameworks include Caffe ( ttp://caffe.berkeleyvision.org/), Apache MXNet 
( ttps://mxnet.apache.org/) and PyTorch( ttps://pytorch.org/). 


* Chollet, Francois. Deep Learning with Python. Shelter Island, NY: Manning Publications, 
2018. 


3 ttp://theweek.com/speedreads/654463/google-more-than-1000- 








rtificial-intelligence-projects-works. 


4 ttps://www.zdnet.com/article/google-says-exponential-growth-of- 


i-is-changing-nature-of-compute/. 


Models 


Deep learning models are complex and require an extensive mathematical background to 
understand their inner workings. As we’ve done throughout the book, we'll avoid heavy 


mathematics here, preferring English explanations. 


Keras is to deep learning as Scikit-learn is to machine learning. Each encapsulates the 
sophisticated mathematics, so developers need only define, parameterize and manipulate 
objects. With Keras, you build your models from pre-existing components and quickly 
parameterize those components to your unique requirements. This is what we’ve been 


referring to as object-based programming throughout the book. 


Experiment with Your Models 


Machine learning and deep learning are empirical rather than theoretical fields. You'll 
experiment with many models, tweaking them in various ways until you find the models that 


perform best for your applications. Keras facilitates such experimentation. 


Dataset Sizes 


Deep learning works well when you have lots of data, but it also can be effective for smaller 
datasets when combined with techniques like transfer learning °’ © and data augmentation 
7-8 Transfer learning uses existing knowledge from a previously trained model as the 
foundation for a new model. Data augmentation adds data to a dataset by deriving new data 
from existing data. For example, in an image dataset, you might rotate the images left and 
right so the model can learn about objects in different orientations. In general, though, the 


more data you have, the better you'll be able to train a deep learning model. 


5 ttps://towardsdatascience.com/transfer-learning-from-pre-trained- 


odels-f£2393f124751. 


$ ttps://medium.com/nanonets/nanonets-how-to-use-deep-learning- 


hen-you-have-limited-data-f68c0b512cab. 


7 ttps://towardsdatascience.com/data-augmentation-and-images- 


aca9bd0dbe8. 


3 ttps://medium.com/nanonets/how-to-use-deep-learning-when-you- 


ave-limited-data-part-2-data-augmentation-c26971dc8ced. 


rocessing Power 


Deep learning can require significant processing power. Complex models trained on big-data 
datasets can take hours, days or even more to train. The models we present in this chapter 
can be trained in minutes to just less than an hour on computers with conventional CPUs. 
You'll need only a reasonably current personal computer. We'll discuss the special high- 
performance hardware called GPUs (Graphics Processing Units) and TPUs (Tensor 
Processing Units) developed by NVIDIA and Google to meet the extraordinary processing 


demands of edge-of-the-practice deep-learning applications. 


Bundled Datasets 


Keras comes packaged with some popular datasets. You'll work with two of these datasets in 
the chapter’s examples. You can find many Keras studies online for each of these datasets, 


including ones that take different approaches. 


In the “Machine Learning” chapter, you worked with Scikit-learn’s Digits dataset, which 
contained 1797 handwritten-digit images that were selected from the much larger MNIST 
dataset (60,000 training images and 10,000 test images). ? In this chapter you'll work with 
the full MNIST dataset. You'll build a Keras convolutional neural network (CNN or convnet) 
model that will achieve high performance recognizing digit images in the test set. Convnets 
are especially appropriate for computer vision tasks, such as recognizing handwritten digits 
and characters or recognizing objects (including faces) in images and videos. You'll also work 
with a Keras recurrent neural network. In that example, you'll perform sentiment analysis 
using the IMDb Movie reviews dataset, in which the reviews in the training and testing sets 


are labeled as positive or negative. 


° The MNIST Database. MNIST Handwritten Digit Database, Yann LeCun, Corinna Cortes 
and Chris Burges. ttp://yann.lecun.com/exdb/mnist/. 


Future of Deep Learning 


Newer automated deep learning capabilities are making it even easier to build deep-learning 
solutions. These include Auto-Keras ° from Texas A&M University’s DATA Lab, Baidu’s 
EZDL-* and Google’s AutoML *. 


° ttps://autokeras.com/. 


* ttps://ai.baidu.com/ezdl/. 


2 


ttps://cloud.google.com/automl/. 


15.1.1 Deep Learning Applications 


Deep learning is being used in a wide range of applications, such as: 


Game playing 

Computer vision: Object recognition, pattern recognition, facial recognition 
Self-driving cars 

Robotics 

Improving customer experiences 

Chatbots 

Diagnosing medical conditions 

Google Search 

Facial recognition 

Automated image captioning and video closed captioning 

Enhancing image resolution 

Speech recognition 

Language translation 

Predicting election results 

Predicting earthquakes and weather 

Google Sunroof to determine whether you can put solar panels on your roof 


Generative applications—Generating original images, processing existing images to look 
like a specified artist’s style, adding color to black-and-white images and video, creating 


music, creating text (books, poetry) and much more. 


15.1.2 Deep Learning Demos 


Check out these four deep-learning demos and search online for lots more, including 


practical applications like we mentioned in the preceding section: 


DeepArt.io—Turn a photo into artwork by applying an art style to the photo. 
ttps://deepart.io/. 


DeepWarp Demo—Analyzes a person’s photo and makes the person’s eyes move in 
different directions. 


ttps://sites.skoltech.ru/sites/compvision wiki/static pages/projects/dee 





Image-to-Image Demo—Translates a line drawing into a picture. 


ttps://affinelayer.com/pixsrv/. 


Google Translate Mobile App (download from an app store to your smartphone)— 


ranslate text in a photo to another language (e.g., take a photo of a sign or a restaurant 


menu in Spanish and translate the text to English). 


15.1.3 Keras Resources 


Here are some resources you might find valuable as you study deep learning: 


To get your questions answered, go to the Keras team’s slack channel at 


ttps://kerasteam.slack.com. 
For articles and tutorials, visit ttps://blog.keras.io. 
The Keras documentation is at ttp://keras.io. 


If you’re looking for term projects, directed study projects, capstone course projects or 

thesis topics, visit arXiv (pronounced “archive,” where the X represents the Greek letter 
“chi”) at ttps://arXiv.org. People post their research papers here in parallel with 
going through peer review for formal publication, hoping for fast feedback. So, this site 


gives you access to extremely current research. 


15.2 KERAS BUILT-IN DATASETS 


Here are some of Keras’s datasets (from the module tensorflow.keras.datasets °) for 


practicing deep learning. We'll use a couple of these in the chapter’s examples: 


3In the standalone Keras library, the module names begin with keras rather than 


tensorflow.keras. 


4 
e MNIST database of handwritten digits—Used for classifying handwritten digit 


images, this dataset contains 28-by-28 grayscale digit images labeled as o through 9 with 
60,000 images for training and 10,000 for testing. We use this dataset in ection 15.6, 


where we study convolutional neural networks. 


4The MNIST Database. MNIST Handwritten Digit Database, Yann LeCun, Corinna 
Cortes and Chris Burges. ttp://yann.lecun.com/exdb/mnist/. 


Fashion-MNIST ; database of fashion articles—Used for classifying clothing 
images, this dataset contains 28-by-28 grayscale images of clothing labeled in 10 
categories ° with 60,000 for training and 10,000 for testing. Once you build a model for 
use with MNIST, you can reuse that model with Fashion-MNIST by changing a few 


statements. 


IMDb Movie reviews 7—Used for sentiment analysis, this dataset contains reviews 
labeled as positive (1) or negative (0) sentiment with 25,000 reviews for training and 
25,000 for testing. We use this dataset in ection 15.9, where we study recurrent neural 


networks. 


°Han Xiao and Kashif Rasul and Roland Vollgraf, Fashion-MNIST: a Novel Image 
Dataset for Benchmarking Machine Learning Algorithms, arXiv, cs.LG/1708.07747. 


ttps://keras.io/datasets/#fashion-mnist-database-of-fashion- 


rticles. 


7Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and 
Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th 
Annual Meeting of the Association for Computational Linguistics (ACL 2011). 


e CIFARi0 Ê small image classification—Used for small-image classification, this 
dataset contains 32-by-32 color images labeled in 10 categories with 50,000 images for 
training and 10,000 for testing. 


3 ttps://www.cs.toronto.edu/~kriz/cifar.html. 
9 
e CIFAR100 small image classification—Also, used for small-image classification, 
this dataset contains 32-by-32 color images labeled in 100 categories with 50,000 images 


for training and 10,000 for testing. 


? ttps://www.cs.toronto.edu/~kriz/cifar.html. 


15.3 CUSTOM ANACONDA ENVIRONMENTS 


Before running this chapter’s examples, yov’ll need to install the libraries we use. In this 
chapter’s examples, we’ll use the TensorFlow deep-learning library’s version of Keras. ° At 
the time of this writing, TensorFlow does not yet support Python 3.7. So, you’ll need Python 
3.6.x to execute this chapter’s examples. We'll show you how to set up a custom environment 


for working with Keras and TensorFlow. 


°Theres also a standalone version that enables you to choose between TensorFlow, 
Microsofts CNTK or the Université de Montréals Theano (which ceased development in 


2017). 


Environments in Anaconda 


The Anaconda Python distribution makes it easy to create custom environments. These are 
separate configurations in which you can install different libraries and different library 
versions. This can help with reproducibility if your code depends on specific Python or 


library versions. * 


“In the next chapter, well introduce Docker as another reproducibility mechanism and as a 


convenient way to install complex environments for use on your local computer. 


The default environment in Anaconda is called the base environment. This is created for you 
when you install Anaconda. All the Python libraries that come with Anaconda are installed 
into the base environment and, unless you specify otherwise, any additional libraries you 
install also are placed there. Custom environments give you control over the specific libraries 


you wish to install for your specific tasks. 


Creating an Anaconda Environment 


The conda create command creates an environment. Let’s create a TensorFlow 


environment and name it tf_env (you can name it whatever you like). Run the following 


command in your Terminal, shell or Anaconda Command Prompt: ® 3 
“Windows users should run the Anaconda Command Prompt as Administrator, 


ŝIf you have a computer with an NVIDIA GPU thats compatible with TensorFlow, you can 
replace the tensorflow library with tensorflow-gpu to get better performance. For more 
information, see ttps://www.tensorflow.org/install/gpu. Some AMD GPUs also 
can be used with TensorFlow: ttp://timdettmers.com/2018/11/05/which-gpu- 


or-deep-learning/. 


lick here to view code image 


conda create -n tf_env tensorflow anaconda ipython jupyterlab scikit-learn matp 








4 | | > 











his will determine the listed libraries’ dependencies, then display all the libraries that will be 
installed in the new environment. There are many dependencies, so this may take a few 


minutes. When you see the prompt: 


Proceed ([y]/n)? 


press Enter to create the environment and install the libraries. 4 


4when we created our custom environment, conda installed Python 3.6.7, which was the 


most recent Python version compatible with the tensor flow library. 


Activating an Alternate Anaconda Environment 


To use a custom environment, execute the conda activate command: 


conda activate tf_env 


This affects only the current Terminal, shell or Anaconda Command Prompt. When a custom 
environment is activated and you install more libraries, they become part of the activated 
environment, not the base environment. If you open separate Terminals, shells or Anaconda 


Command Prompts, they’ll use Anaconda’s base environment by default. 


Deactivating an Alternate Anaconda Environment 


When youre done with a custom environment, you can return to the base environment in the 


current Terminal, shell or Anaconda Command Prompt by executing: 


conda deactivate 


Jupyter Notebooks and JupyterLab 


This chapter’s examples are provided only as Jupyter Notebooks, which will make it easier for 


you to experiment with the examples. You can tweak the options we present and reexecute 


he notebooks. For this chapter, you should launch JupyterLab from the ch15 examples 


folder (as discussed in ection 1.5.3). 


15.4 NEURAL NETWORKS 


Deep learning is a form of machine learning that uses artificial neural networks to learn. An 
artificial neural network (or just neural network) is a software construct that operates 
similarly to how scientists believe our brains work. Our biological nervous systems are 
controlled via neurons ° that communicate with one another along pathways called 

synapses ©. As we learn, the specific neurons that enable us to perform a given task, like 
walking, communicate with one another more efficiently. These neurons activate anytime we 


need to walk. 7 


5 ttps://en.wikipedia.org/wiki/Neuron. 


2 ttps://en.wikipedia.org/wiki/Synapse. 
7? ttps://www.sciencenewsforstudents.org/article/learning-rewires- 


rain. 


Artificial Neurons 


In a neural network, interconnected artificial neurons simulate the human brain’s 
neurons to help the network learn. The connections between specific neurons are reinforced 
during the learning process with the goal of achieving a specific result. In supervised deep 
learning—which we'll use in this chapter—we aim to predict the target labels supplied with 
data samples. To do this, we'll train a general neural network model that we can then use to 


make predictions on unseen data. ° 


8s in machine learning, you can create unsupervised deep learning networksthese are 


beyond this chapters scope. 


Artificial Neural Network Diagram 


The following diagram shows a three-layer neural network. Each circle represents a neuron, 
and the lines between them simulate the synapses. The output of a neuron becomes the input 
of another neuron, hence the term neural network. This particular diagram shows a fully 
connected network—every neuron in a given layer is connected to all the neurons in the 


next layer: 


Input layer Hidden layer Output layer 


64 
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Learning Is an Iterative Process 


When you were a baby, you did not learn to walk instantaneously. You learned that process 
over time with repetition. You built up the smaller components of the movements that 
enabled you to walk—learning to stand, learning to balance to remain standing, learning to 
lift your foot and move it forward, etc. And you got feedback from your environment. When 
you walked successfully your parents smiled and clapped. When you fell, you might have 
bumped your head and felt pain. 


Similarly, we train neural networks iteratively over time. Each iteration is known as an 
epoch and processes every sample in the training dataset once. There’s no “correct” number 
of epochs. This is a hyperparameter that may need tuning, based on your training data and 
your model. The inputs to the network are the features in the training samples. Some layers 
learn new features from previous layers’ outputs and others interpret those features to make 


predictions. 


How Artificial Neurons Decide Whether to Activate Synapses 


During the training phase, the network calculates values called weights for every connection 
between the neurons in one layer and those in the next. On a neuron-by-neuron basis, each of 
its inputs is multiplied by that connection’s weight, then the sum of those weighted inputs is 
passed to the neuron’s activation function. This function’s output determines which 
neurons to activate based on the inputs—just like the neurons in your brain passing 
information around in response to inputs coming from your eyes, nose, ears and more. The 
following diagram shows a neuron receiving three inputs (the black dots) and producing an 
output (the hollow circle) that would be passed to all or some of neurons in the next layer, 
depending on the types of the neural network’s layers: 


Inputs Neuron Output 





The values w,, ws and wg are weights. In a new model that you train from scratch, these 
values are initialized randomly by the model. As the network trains, it tries to minimize the 
error rate between the network’s predicted labels and the samples’ actual labels. The error 
rate is known as the loss, and the calculation that determines the loss is called the loss 
function. Throughout training, the network determines the amount that each neuron 
contributes to the overall loss, then goes back through the layers and adjusts the weights in 
an effort to minimize that loss. This technique is called backpropagation. Optimizing these 


weights occurs gradually—typically via a process called gradient descent. 


15.5 TENSORS 


Deep learning frameworks generally manipulate data in the form of tensors. A “tensor” is 
basically a multidimensional array. Frameworks like TensorFlow pack all your data into one 
or more tensors, which they use to perform the mathematical calculations that enable neural 
networks to learn. These tensors can become quite large as the number of dimensions 
increases and as the richness of the data increases (for example, images, audios and videos 
are richer than text). Chollet discusses the types of tensors typically encountered in deep 


learning: ? 


°Chollet, Francois. Deep Learning with Python. ection 2.2. Shelter Island, NY: Manning 
Publications, 2018. 


e oD (o-dimensional) tensor—This is one value and is known as a scalar. 


e 1D tensor—This is similar to a one-dimensional array and is known as a vector. A 1D 
tensor might represent a sequence, such as hourly temperature readings from a sensor or 


the words of one movie review. 


e 2D tensor—This is similar to a two-dimensional array and is known as a matrix. A 2D 
tensor could represent a grayscale image in which the tensor’s two dimensions are the 
image’s width and height in pixels, and the value in each element is the intensity of that 


pixel. 


e 3D tensor—This is similar to a three-dimensional array and could be used to represent a 


olor image. The first two dimensions would represent the width and height of the image 
in pixels and the depth at each location might represent the red, green and blue (RGB) 
components of a given pixel’s color. A 3D tensor also could represent a collection of 2D 


tensors containing grayscale images. 


e 4D tensor—A 4D tensor could be used to represent a collection of color images in 3D 
tensors. It also could be used to represent one video. Each frame in a video is essentially a 


color image. 


e 5D tensor—This could be used to represent a collection of 4D tensors containing videos. 


A tensor’s shape typically is represented as a tuple of values in which the number of elements 
specifies the tensor’s number of dimensions and each value in the tuple specifies the size of 


the tensor’s corresponding dimension. 


Let’s assume we're creating a deep-learning network to identify and track objects in 4K (high- 
resolution) videos that have 30 frames-per-second. Each frame in a 4K video is 3840-by- 
2160 pixels. Let’s also assume the pixels are presented as red, green and blue components of 
a color. So each frame would be a 3D tensor containing a total of 24,883,200 elements (3840 
* 9160 * 3) and each video would be a 4D tensor containing the sequence of frames. If the 


videos are one minute long, you'd have 44,789,760,000 elements per tensor! 


Over 600 hours of video are uploaded to YouTube every minute ° so, in just one minute of 
uploads, Google could have a tensor containing 1,612,431,360,000,000 elements to use in 
training deep-learning models—that’s big data. As you can see, tensors can quickly become 
enormous, so manipulating them efficiently is crucial. This is one of the key reasons that 
most deep learning is performed on GPUs. More recently Google created TPUs (Tensor 
Processing Units) that are specifically designed to perform tensor manipulations, executing 
faster than GPUs. 


° ttps://www.inc.com/tom-popomaronis/youtube-analyzed-trillions-of- 


data-points-in-2018-revealing-5-eye-opening-behavioral- 


tatistics.html. 


High-Performance Processors 


Powerful processors are needed for real-world deep learning because the size of tensors can 
be enormous and large-tensor operations can place crushing demands on processors. The 


processors most commonly used for deep learning are: 


e NVIDIA GPUs (Graphics Processing Units)—Originally developed by companies like 
NVIDIA for computer gaming, GPUs are much faster than conventional CPUs for 
processing large amounts of data, thus enabling developers to train, validate and test 
deep-learning models more efficiently—and thus experiment with more of them. GPUs 
are optimized for the mathematical matrix operations typically performed on tensors, an 
essential aspect of how deep learning works “under the hood.” NVIDIA’s Volta Tensor 
Cores are specifically designed for deep learning. » ° Many NVIDIA GPUs are compatible 
with TensorFlow, and hence Keras, and can enhance the performance of your deep- 


learning models. 2 


l ttps://www.nvidia.com/en-us/data-center/tensorcore/. 


2 ttps://devblogs.nvidia.com/tensor-core-ai-performance- 


ilestones/. 
3 ttps://www.tensorflow.org/install/gpu. 


e Google TPUs (Tensor Processing Units)—Recognizing that deep learning is crucial to its 
future, Google developed TPUs (Tensor Processing Units), which they now use in their 
Cloud TPU service, which “can provide up to 11.5 petaflops of performance in a single 
pod” 4 (that’s 11.5 quadrillion floating-point operations per second). Also, TPUs are 
designed to be especially energy efficient. This is a key concern for companies like Google 
with already massive computing clusters that are growing exponentially and consuming 


vast amounts of energy. 


4 ttps://cloud.google.com/tpu/. 


15.6 CONVOLUTIONAL NEURAL NETWORKS FOR VISION; 
MULTI-CLASSIFICATION WITH THE MNIST DATASET 


In the “Machine Learning” chapter, we classified handwritten digits using the 8-by-8-pixel, 
low-resolution images from the Digits dataset bundled with Scikit-learn. That dataset is 
based on a subset of the higher-resolution MNIST handwritten digits dataset. Here, we'll use 
MNIST to explore deep learning with a convolutional neural network ° (also called a 
convnet or CNN). Convnets are common in computer-vision applications, such as 
recognizing handwritten digits and characters, and recognizing objects in images and video. 
They’re also used in non-vision applications, such as natural-language processing and 


recommender systems. 
5 ttps://en.wikipedia. org/wiki/Convolutional neural network. 


The Digits dataset has only 1797 samples, whereas MNIST has 70,000 labeled digit image 
samples—60,000 for training and 10,000 for testing. Each sample is a grayscale 28-by-28 
pixel image (784 total features) represented as a NumPy array. Each pixel is a value from 0 to 
255 representing the intensity (or shade) of that pixel—the Digits dataset uses less granular 
shading with values from o to 16. MNIST’s labels are integer values in the range o through 9, 


indicating the digit each image represents. 


The machine-learning model you used in the previous chapter produced as its output a digit 
image’s predicted class—an integer in the range 0-9. The convnet model we'll build will 
perform probabilistic classification. ° For each digit image, the model will output an 
array of 10 probabilities, each indicating the likelihood that the digit belongs to a particular 
one of the classes o through 9. The class with the highest probability is the predicted value. 


$ ttps://en.wikipedia.org/wiki/Probabilistic_classification. 





Reproducibility in Keras and Deep Learning 


We've discussed the importance of reproducibility throughout the book. In deep learning, 


reproducibility is more difficult because the libraries heavily parallelize operations that 


perform floating-point calculations. Each time operations execute, they may execute in a 
different order. This can produce differences in your results. Getting reproducible results in 
Keras requires a combination of environment settings and code settings that are described in 
the Keras FAQ: 


lick here to view code image 


ttps://keras.io/getting-started/faq/#how-can-i-obtain-reproducible-results-usi 





> 





asic Keras Neural Network 


A Keras neural network consists of the following components: 


e A network (also called a model)—A sequence of layers containing the neurons used to 
learn from the samples. Each layer’s neurons receive inputs, process them (via an 
activation function) and produce outputs. The data is fed into the network via an input 
layer that specifies the dimensions of the sample data. This is followed by hidden 
layers of neurons that implement the learning and an output layer that produces the 
predictions. The more layers you stack, the deeper the network is, hence the term deep 


learning. 


e A loss function—This produces a measure of how well the network predicts the target 


values. Lower loss values indicate better predictions. 


e An optimizer—This attempts to minimize the values produced by the loss function to 
tune the network to make better predictions. 


Launch JupyterLab 


This section assumes that you’ve activated the tf_env Anaconda environment you created in 

ection 15.3 and launched JupyterLab from the ch15 examples folder. You can either open 
the MNIST_CNN.ipynb file in JupyterLab and execute the code in the cells we provided, or 
you can create a new notebook and enter the code on your own. If you prefer, you can work at 
the command line in IPython, however, placing your code in a Jupyter Notebook makes it 


significantly easier for you to re-execute this chapter’s examples. 


As a reminder, you can reset a Jupyter Notebook and remove its outputs by selecting 
Restart Kernel and Clear All Outputs from JupyterLab’s Kernel menu. This terminates 
the notebook’s execution and removes its outputs. You might do this if your model is not 
performing well and you want to try different hyperparameters or possibly restructure your 
neural network. ” You can then re-execute the notebook one cell at a time or execute the 


entire notebook by selecting Run All from JupyterLab’s Run menu. 
7We found that we sometimes had to execute this menu option twice to clear the outputs. 


15.6.1 Loading the MNIST Dataset 


Let’s import the tensorflow.keras.datasets.mnist module so we can load the 


dataset: 


lick here to view code image 
EGON tensorflow.keras.datasets import mnist 


Note that because we’re using the version of Keras built into TensorFlow, the Keras module 
names begin with "tensorflow.". In the standalone Keras version, the module names 
begin with "keras .", so keras.datasets would be used above. Keras uses TensorFlow to 


execute the deep-learning models. 
The mnist module’s load_data function loads the MNIST training and testing sets: 


lick here to view code image 
[21 (Ce train, y trarn), (X test, yo test) = maist- load data) 


When you call load_data it will download the MNIST data to your system. The function 
returns a tuple of two elements containing the training and testing sets. Each element is itself 


a tuple containing the samples and labels, respectively. 


15.6.2 Data Exploration 


Let’s get to know the data before working with it. First, we check the dimensions of the 
training set images (X_t rain), training set labels (y_ train), testing set images (x_test) 


and testing set labels (y_ test): 


Ww 


: X_train.shape 
3]: (60000, 28, 28) 


4): y_train.shape 
4]: (60000,) 


5]: X_test.shape 
5]: (10000, 28, 28) 


6] y_test.shape 
(10000, ) 








You can see from X_train’s and X_test’s shapes that the images are higher resolution than 
those in Scikit-learn’s Digits dataset (which are 8-by-8). 
Visualizing Digits 


Let’s visualize some of the digit images. First, enable Matplotlib in the notebook, import 
Matplotlib and Seaborn and set the font scale: 


lick here to view code image 


[7]: tmatplotlib inline 
[8]: import matplotlib.pyplot as plt 


[9]: import seaborn as sns 


[Lol isns: set (font seadle=2) 


The IPython magic 


smatplotlib inline 


indicates that Matplotlib-based graphics should be displayed in the notebook rather than in 


separate windows. For more IPython magics, you can use in Jupyter Notebooks, see: 


ttps://ipython.readthedocs.io/en/stable/interactive/magics-.html 


Next, we'll display a randomly selected set of 24 MNIST training set images. Recall from the 

“Array-Oriented Programming with NumPy” chapter that you can pass a sequence of indexes 
as a NumPy array’s subscript to select only the array elements at those indexes. We’ll use that 
capability here to select the elements at the same indexes in both the X_trainandy train 


arrays. This ensures that we display the correct label for each randomly selected image. 


NumPy’s choice function (from the numpy . random module) randomly selects the 
number of elements specified in its second argument (24) from the array of values in its first 
argument (in this case, an array containing X_train’s range of indices). The function returns 
an array containing the selected values, which we store in index. The expressions 

X train[index] andy train[index] use index to get the corresponding elements 
from both arrays. The rest of this cell is the visualization code from the previous chapter’s 


Digits case study: 


lick here to view code image 


[11]: import numpy as np 
index = np.random. choice (np.arange(len(X_train)), 24, replace=False) 
figure, axes = plt.subplots (nrows=4, ncols=6, figsize=(16, 9)) 


for item in zip(axes.ravel(), X_train[index], y_train[index]): 
axes, image, target = item 
axes.imshow(image, cmap=plt.cm.gray r) 
axes.set_xticks([]) # remove x-axis tick marks 
axes.set_yticks([]) # remove y-axis tick marks 
axes set title (target) 
plt.tight_layout () 


You can see in the output below that MNIST’s digit images have higher resolution than those 


in Scikit-learn’s Digits dataset. 


ou ¢s £7 I 
4738 ¢/ 6 # 
¥ eb 6b 7 
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Looking at the digits, you can see why handwritten digit recognition is a challenge: 


e Some people write “open” 4s (like the ones in the first and third rows), and some write 
“closed” 4s (like the one in the second row). Though each 4 has some similar features, 


they’re all different from one another. 


e The 3 in the second row looks strange—more like a merged 6 and 7. Compare this to the 


much clearer 3 in the fourth row. 
e The 5 in the second row could easily be confused with a 6. 


e Also, people write their digits at different angles, as you can see with the four 6s in the 


third and fourth rows—two are upright, one leans left and one leans right. 


If you run the preceding snippet multiple times, you can see additional randomly selected 
digits. ê You'll probably find that—if not for the labels displayed above each digit—it would be 
difficult for you to identify some of the digits. We’ll soon see how accurately our first convnet 
will predict the digits in the MNIST test set. 


Sif you do run the cell multiple times, the snippet number next to the cell will increment 


each time, as it does in [Python at the command line. 


15.6.3 Data Preparation 


Recall from the “Machine Learning” chapter that Scikit-learn’s bundled datasets were 
preprocessed into the shapes its models required. In real-world studies, you'll generally have 
to do some or all of the data preparation. The MNIST dataset requires some preparation for 


use in a Keras convnet. 


Reshaping the Image Data 


Keras convnets require NumPy array inputs in which each sample has the shape: 


(width, height, channels) 


For MNIST, each image’s width and height are 28 pixels, and each pixel has one channel (the 


grayscale shade of the pixel from 0 to 255), so each sample’s shape will be: 


Full-color images with RGB (red/green/blue) values for each pixel, would have three 


channels—one channel each for the red, green and blue components of a color. 


As the neural network learns from the images, it creates many more channels. Rather than 
shade or color, the learned channels will represent more complex features, like edges, curves 
and lines, that will eventually enable the network to recognize digits based on these 


additional features and how they’re combined. 


Let’s reshape the 60,000 training and 10,000 testing set images into the correct dimensions 
for use in our convnet and confirm their new shapes. Recall that NumPy array method 


reshape receives a tuple representing the array’s new shape: 


lick here to view code image 


12]: X_train = X_train.reshape( (60000, Zoe 2er )h) 


13]: X_train.shape 
ESI “C6000, 287 287 1) 


14]: X_test = X_test.reshape((10000, Sie Zor W) 


15]: X_test.shape 
Lilie (L10000 28; 287. 1) 








Normalizing the Image Data 


Numeric features in data samples may have value ranges that vary widely. Deep learning 
networks perform better on data that is scaled either into the range 0.0 to 1.0, or to a range 
for which the data’s mean is 0.0 and its standard deviation is 1.0. ? Getting your data into one 


of these forms is known as normalization. 


°S. Ioffe and Szegedy, C.. Batch Normalization: Accelerating Deep Network Training by 
Reducing Internal Covariate Shift. ttps://arxiv.org/abs/1502.03167. 


In MNIST, each pixel is an integer in the range 0-255. The following statements convert the 
values to 32-bit (4-byte) floating-point numbers using the NumPy array method ast ype, 
then divide every element in the resulting array by 255, producing normalized values in the 


range 0.0-1.0: 


lick here to view code image 


[16]: X_train = X_train.astype('float32') / 255 


Lv test ex GtSst astype r Vodikee.” )) 255) 


One-Hot Encoding: Converting the Labels From Integers to Categorical Data 


As we mentioned, the convnet’s prediction for each digit will be an array of 10 probabilities, 
indicating the likelihood that the digit belongs to a particular one of the classes o through 9. 
When we evaluate the model’s accuracy, Keras compares the model’s predictions to the 
labels. To do that, Keras requires both to have the same shape. The MNIST label for each 
digit, however, is one integer value in the range 0—9. So, we must transform the labels into 
categorical data—that is, arrays of categories that match the format of the predictions. To 
do this, we'll use a process called one-hot encoding, ° which converts data into arrays of 
1.0s and 0.0s in which only one element is 1.0 and the rest are 0.0s. For MNIST, the one-hot- 
encoded values will be 10-element arrays representing the categories 0 through 9. One-hot 


encoding also can be applied to other types of data. 


° his term comes from certain digital circuits in which a group of bits is allowed to have only 
one bit turned on (that is, to have the value 1). ttps://en.wikipedia.org/wiki/One- 


ot. 


We know precisely which category each digit belongs to, so the categorical representation of a 
digit label will consist of a 1.0 at that digit’s index and 0.0s for all the other elements (again, 


Keras uses floating-point numbers internally). So, a 7’s categorical representation is: 


lick here to view code image 


and a 3’s representation is: 


lick here to view code image 


The tensorflow.keras.utils module provides function to_categorical to perform 
one-hot encoding. The function counts the unique categories then, for each item being 
encoded, creates an array of that length with a 1.0 in the correct position. Let’s transform 
y_trainandy test from one-dimensional arrays containing the values 0—9 into two- 
dimensional arrays of categorical data. After doing so, the rows of these arrays will look like 
those shown above. Snippet [21] outputs one sample’s categorical data for the digit 5 (recall 


that NumPy shows the decimal point, but not trailing 0s on floating-point values): 


lick here to view code image 


tel: Erom tensorflow.keras.utils import to categorical 
19]: y train = to categorically train) 


20]: y train shape 
20]: (60000, 10) 


21): yotrain| 0] 
2s Tar Rav Cl. Oly ORS OR OK OR lasts ORS ORS OR 0.], dtype=float32 








22) 3 yO test = to categorically test) 


[2s vaivestashape 
P2303) (LOO 0). L0) 























5.6.4 Creating the Neural Network 


Now that we’ve prepared the data, we'll configure a convolutional neural network. We begin 


with the Keras Sequential model from the tensorflow.keras.models module: 


lick here to view code image 


[24] Erom tensorflow.keras.models import Sequential 


[25]: cnn = Sequential () 


The resulting network will execute its layers sequentially—the output of one layer becomes 
the input to the next. This is known as a feed-forward network. As you'll see when we 


discuss recurrent neural networks, not all neural network operate this way. 


Adding Layers to the Network 


A typical convolutional neural network consists of several layers—an input layer that receives 
the training samples, hidden layers that learn from the samples and an output layer that 
produces the prediction probabilities. We'll create a basic convnet here. Let’s import from the 


tensorflow.keras.layers module the layer classes we'll use in this example: 


lick here to view code image 


el: from tensorflow.keras.layers import Conv2D, Dense, Flatten, 


MaxPooling2D- 
We discuss each below. 


Convolution 


We'll begin our network with a convolution layer, which uses the relationships between 
pixels that are close to one another to learn useful features (or patterns) in small areas of 


each sample. These features become inputs to subsequent layers. 


The small areas that convolution learns from are called kernels or patches. Let’s examine 
convolution on a 6-by-6 image. Consider the following diagram in which the 3-by-3 shaded 
square represents the kernel—the numbers are simply position numbers showing the order in 


which the kernels are visited and processed: 


Input to the convolutional layer Output from the convolutional layer 
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4-by-4 after convolution 




















6-by-6 before convolution 


The small areas that convolution learns from are called kernels or patches. Let’s examine 
convolution on a 6-by-6 image. Consider the following diagram in which the 3-by-3 shaded 
square represents the kernel—the numbers are simply position numbers showing the order in 
which the kernels are visited and processed: 


You can think of the kernel as a “sliding window” that the convolution layer moves one pixel 
at a time left-to-right across the image. When the kernel reaches the right edge, the 
convolution layer moves the kernel one pixel down and repeats this left-to-right process. 
Kernels typically are 3-by-3, * though we found convnets that used 5-by-5 and 7-by-7 for 
higher-resolution images. Kernel-size is a tunable hyperparameter. 


* ttps://www.quora.com/How-can-I-decide-the-kernel-size-output-maps- 


nd-layers-of-CNN. 


Initially, the kernel is in the upper-left corner of the original image—kernel position 1 (the 
shaded square) in the input layer above. The convolution layer performs mathematical 
calculations using those nine features to “learn” about them, then outputs one new feature to 
position 1 in the layer’s output. By looking at features near one another, the network begins to 


recognize features like edges, straight lines and curves. 


Next, the convolution layer moves the kernel one pixel to the right (known as the stride) to 
position 2 in the input layer. This new position overlaps with two of the three columns in the 
previous position, so that the convolution layer can learn from all the features that touch one 
another. The layer learns from the nine features in kernel position 2 and outputs one new 
feature in position 2 of the output, as in: 


Input to the convolutional layer Output from the convolutional layer 
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4-by-4 after convolution 























6-by-6 before convolution 


For a 6-by-6 image and a 3-by-3 kernel, the convolution layer does this two more times to 
produce features for positions 3 and 4 of the layer’s output. Then, the convolution layer 
moves the kernel one pixel down and begins the left-to-right process again for the next four 
kernel positions, producing outputs in positions 5-8, then 9-12 and finally 13—16. The 
complete pass of the image left-to-right and top-to-bottom is called a filter. For a 3-by-3 
kernel, the filter dimensions (4-by-4 in our sample above) will be two less than the input 
dimensions (6-by-6). For each 28-by-28 MNIST image, the filter will be 26-by-26. 


The number of filters in the convolutional layer is commonly 32 or 64 when processing small 
images like those in MNIST, and each filter produces different results. The number of filters 
depends on the image dimensions—higher-resolution images have more features, so they 
require more filters. If you study the code the Keras team used to produce their pretrained 
convnets, ° you'll find that they used 64, 128 or even 256 filters in their first convolutional 
layers. Based on their convnets and the fact that the MNIST images are small, we'll use 64 
filters in our first convolutional layer. The set of filters produced by a convolution layer is 


called a feature map. 


2 





ttps://github.com/keras-team/keras- 





pplications/tree/master/keras applications. 


Subsequent convolution layers combine features from previous feature maps to recognize 
larger features and so on. If we were doing facial recognition, early layers might recognize 
lines, edges and curves, and subsequent layers might begin combining those into larger 
features like eyes, eyebrows, noses, ears and mouths. Once the network learns a feature, 
because of convolution, it can recognize that feature anywhere in the image. This is one of the 


reasons that convnets are used for object recognition in images. 


Adding a Convolution Layer 


Let’s add a Conv2D convolution layer to our model: 


lick here to view code image 


[27]: cnn.add(Conv2D(filters=64, kernel size=(3,;. 3); activation='relu', 
input_shape=(28, 23,70 HN 


he Conv2D layer is configured with the following arguments: 


e £ilters=64—The number of filters in the resulting feature map. 
e kernel size=(3, 3)—The size of the kernel used in each filter. 


e activation='relu'—The 'relu' (Rectified Linear Unit) activation function is 
used to produce this layer’s output. 'relu' is the most widely used activation function in 
today’s deep learning networks ° and is good for performance because it’s easy to 


calculate. 4 It’s commonly recommended for convolutional layers. ° 


3Chollet, Francois. Deep Learning with Python. p. 72. Shelter Island, NY: Manning 


Publications, 2018. 


4 ttps://towardsdatascience.com/exploring-activation-functions-for- 


eural-networks-73498da59b02. 


5 ttps://www.quora.com/How-should-I-choose-a-proper-activation- 


unction-for-the-neural-network. 


Because this is the first layer in the model, we also pass the input_shape=(28, 28,1) 
argument to specify the shape of each sample. This automatically creates an input layer to 
load the samples and pass them into the Conv2D layer, which is actually the first hidden 
layer. In Keras, each subsequent layer infers its input shape from the previous layer’s 


output shape, making it easy to stack layers. 


Dimensionality of the First Convolution Layer’s Output 


In the preceding convolutional layer, the input samples are 28-by-28-by-1—that is, 784 
features each. We specified 64 filters and a 3-by-3 kernel size for the layer, so the output for 
each image is 26-by-26-by-64 for a total of 43,264 features in the feature map—a significant 
increase in dimensionality and an enormous number compared to the numbers of features 
we processed in the “Machine Learning” chapter’s models. As each layer adds more features, 
the resulting feature maps’ dimensionality becomes significantly larger. This is one of the 


reasons that deep learning studies often require tremendous processing power. 


Overfitting 


Recall from the previous chapter, that overfitting can occur when your model is too complex 
compared to what it is modeling. In the most extreme case, a model memorizes its training 
data. When you make predictions with an overfit model, they will be accurate if new data 


matches the training data, but the model could perform poorly with data it has never seen. 


Overfitting tends to occur in deep learning as the dimensionality of the layers becomes too 
large. ê 8 This causes the network to learn specific features of the training-set digit images, 
rather than learning the general features of digit images. Some techniques to prevent 
overfitting include training for fewer epochs, data augmentation, dropout and L1 or L2 


regularization. ® ° We’ll discuss dropout later in the chapter. 


$ ttps://cs231n.github.io/convolutional-networks/. 





7 ttps://medium.com/@cxu24/why-dimensionality-reduction-is- 


mportant-—dd60b5611543. 


$ ttps://towardsdatascience.com/preventing-deep-neural-network-from- 


verfitting-953458db800a. 


? ttps://towardsdatascience.com/deep-learning-3-more-on-cnns- 


andling-overfitting-2bd5d99abe5d. 


° ttps://www.kdnuggets.com/2015/04/preventing-overfitting-neural- 


etworks.html. 


igher dimensionality also increases (and sometimes explodes) computation time. If you’re 
performing the deep learning on CPUs rather than GPUs or TPUs, the training could become 


intolerably slow. 


Adding a Pooling Layer 


To reduce overfitting and computation time, a convolution layer is often followed by one or 
more layers that reduce the dimensionality of the convolution layer’s output. A pooling 
layer compresses (or down-samples) the results by discarding features, which helps make 
the model more general. The most common pooling technique is called max pooling, which 
examines a 2-by-2 square of features and keeps only the maximum feature. To understand 
pooling, let’s once again assume a 6-by-6 set of features. In the following diagram, the 
numeric values in the 6-by-6 square represent the features that we wish to compress and the 


2-by-2 blue square in position 1 represents the initial pool of features to examine: 
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6-by-6 before 2-by-2 max pooling is applied 
The max pooling layer first looks at the pool in position 1 above, then outputs the maximum 
feature from that pool—g in our diagram. Unlike convolution, there’s no overlap between 
pools. The pool moves by its width—for a 2-by-2 pool, the stride is 2. For the second pool, 
represented by the orange 2-by-2 square, the layer outputs 7. For the third pool, the layer 
outputs 9. Once the pool reaches the right edge, the pooling layer moves the pool down by its 
height—2 rows—then continues from left-to-right. Because every group of four features is 


reduced to one, 2-by-2 pooling compresses the number of features by 75%. 


Let’s add aMaxPooling2D layer to our model: 


lick here to view code image 


[28]: cnn.add(MaxPooling2D (pool size=(2, 2))) 


This reduces the previous layer’s output from 26-by-26-by-64 to 13-by-13-by-64. * 
‘Another technique for reducing overfitting is to add Dropout layers. 


Though pooling is a common technique to reduce overfitting, some research suggests that 
additional convolutional layers which use larger strides for their kernels can reduce 


dimensionality and overfitting without discarding features. * 


"Tobias, Jost, Dosovitskiy, Alexey, Brox, Thomas, Riedmiller, and Martin. Striving for 
Simplicity: The All Convolutional Net. April 13, 2015. 
ttps://arxiv.org/abs/1412.6806. 


Adding Another Convolutional Layer and Pooling Layer 


Convnets often have many convolution and pooling layers. The Keras team’s convnets tend to 
double the number of filters in subsequent convolutional layers to enable the model to learn 
more relationships between the features. ? So, let’s add a second convolution layer with 128 


filters, followed by a second pooling layer to once again reduce the dimensionality by 75%: 





3 ttps://github.com/keras-team/keras- 





pplications/tree/master/keras applications. 


lick here to view code image 


[29]: cnn.add(Conv2D(filters=128, kernel saize=\(3), 3) activation='relu')) 


[30]: cnn.add(MaxPooling2D (pool size=(2, 2))) 








«Í | > 











The input to the second convolution layer is the 13-by-13-by-64 output of the first pooling 
layer. So, the output of snippet [29] will be 11-by-11-by-128. For odd dimensions like 11-by- 
11, Keras pooling layers round down by default (in this case to 10-by-10), so this pooling 


layer’s output will be 5-by-5-by-128. 


Flattening the Results 


At this point, the previous layer’s output is three-dimensional (5-by-5-by-128), but the final 
output of our model will be a one-dimensional array of 10 probabilities that classify the 
digits. To prepare for the one-dimensional final predictions, we first need to flatten the 
previous layer’s three-dimensional output. A Keras Flatten layer reshapes its input to one 
dimension. In this case, the Flatten layer’s output will be 1-by-3200 (thatis,5 * 5 * 
128); 


[31]: cnn.add(Flatten() ) 


Adding a Dense Layer to Reduce the Number of Features 


The layers before the Flatten layer learned digit features. Now we need to take all those 
features and learn the relationships among them so our model can classify which digit each 
image represents. Learning the relationships among features and performing classification is 
accomplished with fully connected Dense layers, like those shown in the neural network 
diagram earlier in the chapter. The following Dense layer creates 128 neurons (units) that 


learn from the 3200 outputs of the previous layer: 


lick here to view code image 
[32]: cnn.add(Dense(units=128, activation='relu')) 


Many convnets contain at least one Dense layer like the one above. Convnets geared to more 
complex image datasets with higher-resolution images like Image-Net—a dataset of over 14 
million images *—often have several Dense layers, commonly with 4096 neurons. You can 
see such configurations in several of Keras’s pretrained Image-Net convnets °—we list these 


in ection 15.11. 


4 ttp://www.image-net.org. 





5 ttps://github.com/keras-team/keras- 





pplications/tree/master/keras applications. 


Adding Another Dense Layer to Produce the Final Output 


Our final layer is a Dense layer that classifies the inputs into neurons representing the 
classes 0 through 9. The softmax activation function converts the values of these 
remaining 10 neurons into classification probabilities. The neuron that produces the highest 


probability represents the prediction for a given digit image: 


lick here to view code image 


[33]: cnn.add(Dense(units=10, activation='softmax'") ) 


Printing the Model’s Summary 


Amodel’s summary method shows you the model’s layers. Some interesting things to note 
are the output shapes of the various layers and the number of parameters. The parameters 
are the weights that the network learns during training. ® 7 This is a relatively small network, 
yet it will need to learn nearly 500,000 parameters! And this is for tiny images that have less 
than one quarter of the resolution of the icons on most smartphone home screens. Imagine 
how many features a network would have to learn to process high-resolution 4K video frames 
or the super-high-resolution images produced by today’s digital cameras. In the Output 
Shape, None simply means that the model does not know in advance how many training 


samples you're going to provide—this is known only when you start the training. 


$ ttps://hackernoon.com/everything-you-need-to-know-about-neural- 


etworks-8988c3ee4491. 


7? ttps://www.kdnuggets.com/2018/06/deep-learning-best-practices- 


eight-initialization-.html. 





lick here to view code image 





























[34]: cnn.summary () 

Layer (type) Cutput Shape Param # 
conv2d_1 (Conv2D) (None, 26, 26, 64) 640 

max pooling2d_ 1 (MaxPooling2 (None, 13, 13, 64) 0 
conv2d_2 (Conv2D) (None, 21, 11, 128) 7238256 
max pooling2d 2 (MaxPooling2 (None, 5, 5, 128) 0 
flatten 1 (Platten) (None, 3200) 0 

dense 1 (Dense) (None, 128) 409728 
dense 2 (Dense) (None, 10) 1290 














Total params: 485,514 
Trainable params: 485,514 


Non-trainable params: 0 





Also, note that there are no “non-trainable” parameters. By default, Keras trains all 
parameters, but it is possible to prevent training for specific layers, which is typically done 
when yow’re tuning your networks or using another model’s learned parameters in a new 


model (a process called transfer learning). È 


3 ttps://keras.io/getting-started/faq/#how-can-i-freeze-keras- 


ayers. 


Visualizing a Model’s Structure 


You can visualize the model summary using the plot_mode1 function from the module 


tensorflow.keras.utils: 


lick here to view code image 


[3S] Erom tensorflow.keras.utils import plot model 
from IPython.display import Image 
plot_model(cnn, to_file="convnet.png', show_shapes=True, 
show_layer_names=True) 


Image (filename='convnet.png') 


After storing the visualization in convnet . png, we use module IPython. display’s Image 


class to show the image in the notebook. Keras assigns the layer names in the image: °? 


°The node with the large integer value 112430057960 at the top of the diagram appears to be 


a bug in the current version of Keras. This node represents the input layer and should say 


112430057960 


} input: | (None, 28, 28, 1) 
conv2d_1: Conv2D 
| output: | (None, 26, 26, 64) 


6 ako T | input: | (None, 26, 26, 64) 
max_pooimged 1: axrooing output: | (None, 13. 13.64) 


een input: | (None, 13, 13, 64) 
v : Vv 
conv2d_2: Con | output: | (None, 11, 11, 128) 


O input: | (None, 11, 11, 128) 
max_pooun Š axr ooun 
en =~ | output: | (None, 5, 5, 128) 


| input: | (None, 5,5, 128) 
flatten_1: Flatten 
output: | (None, 3200) 


input: | (None, 3200) 
dense_1: Dense 
output: | (None, 128) 


Speicher | input: | (None, 128) 
ense_<z: nse output: | (None, 10) 


Compiling the Model 
Once you've added all the layers you complete the model by calling its compile method: 


InputLayer. 


lick here to view code image 


[36]: cnn.compile (optimizer='adam', 
loss="categorical crossentropy", 


metrics=['accuracy']) 


The arguments are: 


e optimizer='adam'—The optimizer this model will use to adjust the weights 
throughout the neural network as it learns. There are many optimizers °— 'adam' 


performs well across a wide variety of models. » * 


°For more Keras optimizers, see ttps://keras.io/optimizers/. 


* ttps://medium.com/octavian-ai/which-optimizer-and-learning- 


ate-should-i-use-for-deep-learning-5acb418f9b2. 
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ttps://towardsdatascience.com/types-of-optimization-algorithms- 
used-in-neural-networks-and-ways-to-optimize-gradient-— 


Sae5d39529f. 


e loss='categorical crossentropy'—This is the loss function used by the optimizer 
in multi-classification networks like our convnet, which will predict 10 classes. As the 
neural network learns, the optimizer attempts to minimize the values returned by the loss 
function. The lower the loss, the better the neural network is at predicting what each 
image is. For binary classification (which we’ll use later in this chapter), Keras provides 

"binary crossentropy',and for regression, 'mean_ squared error’. For other 


loss functions, see ttps://keras.io/losses/. 


e metrics=['accuracy' ]—This is a list of the metrics that the network will produce to 
help you evaluate the model. Accuracy is a commonly used metric in classification 
models. In this example, we'll use the accuracy metric to check the percentage of correct 


predictions. For a list of other metrics, see ttps://keras.io/metrics/. 


15.6.5 Training and Evaluating the Model 


Similar to Scikit-learn’s models, we train a Keras model by calling its £it method: 


e As in Scikit-learn, the first two arguments are the training data and the categorical target 
labels. 


e epochs specifies the number of times the model should process the entire set of training 


data. As we mentioned earlier, neural networks are trained iteratively. 


e batch_size specifies the number of samples to process at a time during each epoch. 
Most models specify a power of 2 from 32 to 512. Larger batch sizes can decrease model 
accuracy. 3 We chose 64. You can try different values to see how they affect the model’s 


performance. 


3Keskar, Nitish Shirish, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy and 
Ping Tak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap 


and Sharp Minima. CoRR abs/1609.04836 (2016). 
ttps://arxiv.org/abs/1609.04836. 


e In general, some samples should be used to validate the model. If you specify validation 
data, after each epoch, the model will use it to make predictions and display the 
validation loss and accuracy. You can study these values to tune your layers and the fit 
method’s hyperparameters, or possibly change the layer composition of your model. Here, 
we used the validation_split argument to indicate that the model should reserve 
the last 10% (0 . 1) of the training samples for validation *—in this case, 6000 samples 
will be used for validation. If you have separate validation data, you can use the 
validation data argument (as youll seein ection 15.9) to specify a tuple containing 
arrays of samples and target labels. In general, it’s better to get randomly selected 
validation data. You can use scikit-learn’s train test split function for this purpose 
(as we'll do later in this chapter), then pass the randomly selected data with the 


validation data argument. 


4 ttps://keras.io/getting-started/fag/#how-is-the-validation- 
plit-computed. 


In the following output, we highlighted the training accuracy (acc) and validation accuracy 


(val_acc) in bold: 


lick here to view code image 


[37]: enn.fit(X_train, y train, epochs=5, batch size=64, 
validation _split=0.1) 


Train on 54000 samples, validate on 6000 samples 



































Epoch 1/5 
54000/54000 - 68s 1Ims/step = losist Ova Oui %= 
poch 2/5 
54000/54000 - 64s 1Ims/step = Joss: 0/0/4216 = 
poch 3/5 
54000/54000 - 69s 1Ims/step = Osis: | OOO /98 t 
poch 4/5 
54000/54000 = 70s 1ms/step = losses 00197) E 
pech 55 
54000/54000 - 63s 1Ims/step S oer OliOM oon = 














37]: <tensorflow.python.keras.callbacks.History at 0x7£105ba0ada0> 











In ection 15.7, we'll introduce TensorBoard—a TensorFlow tool for visualizing data from 
your deep-learning models. In particular, we'll view charts showing how the training and 
validation accuracy and loss values change through the epochs. In ection 15.8, we'll 
demonstrate Andrej Karpathy’s ConvnetJS tool, which trains convnets in your web browser 
and dynamically visualizes the layers’ outputs, including what each convolutional layer “sees” 
as it learns. Also run his MNIST and CIFARio models. These will help you better understand 


neural networks’ complex operations. 


As the training proceeds, the £it method outputs information showing you the progress of 
each epoch, how long the epoch took to execute (in this case, each took 63—70 seconds), and 


the evaluation metrics for that pass. During the last epoch of this model, the accuracy reached 


99.48% for the training samples (acc) and 99.27% for the validation samples (val_acc). 
Those are impressive numbers, given that we have not yet tried to tune the hyperparameters 
or tweak the number and types of the layers, which could lead to even better (or worse) 
results. Like machine learning, deep learning is an empirical science that benefits from lots of 


experimentation. 


Evaluating the Model 


Now we can check the accuracy of the model on data the model has not yet seen. To do so, we 
call the model’s model’s evaluate method, which displays as its output, how long it took to 


process the test samples (four seconds and 366 microseconds in this case): 


lick here to view code image 








38]: loss, accuracy = cnn.evaluate(X_test, y test) 
10000/10000 [ j] = 4s 366us/step 
Soils, TOSS 


39]: 0.026809450998473768 


40]: accuracy 
402 0.9917 








According to the preceding output, our convnet model is 99.17% accurate when predicting the 
labels for unseen data—and, at this point, we have not tried to tune the model. With a little 
online research, you can find models that can predict MNIST with nearly 100% accuracy. Try 
experimenting with different numbers of layers, types of layers and layer parameters and 


observe how those changes affect your results. 


Making Predictions 


The model’s predict method predicts the classes of the digit images in its argument array 
(X test): 


lick here to view code image 


[Al]: predictions = cnn.predice(X test) 


We can check what the first sample digit should be by looking at y test [0]: 


lick here to view code image 


[42)2 yovest )0] 
[AZ array (T0 Olin: Ose. Olen Ole Olen Olen lca Olan onn dtype=float32) 


According to this output, the first sample is the digit 7, because the categorical representation 
of the test sample’s label specifies a 1.0 at index 7—recall that we created this representation 


via one-hot encoding. 


Let’s check the probabilities returned by the predict method for the first test sample: 


lick here to view code image 


[43]: for index, probability in enumerate (predictions[0]): 
printf’ andei. (probabi kity: 103) ") 
.0000000201% 
-00000013553 
-00001869517 
-0000015494% 
. 0000000003% 
.0000000012% 
0.0000000000% 
99.9999761581% 
0.0000005577% 
0.0000011416% 


{ze TL 1 a Dy ee P| a at =) 


According to the output, predictions [0] indicates that our model believes this digit is a 7 


with nearly 100% certainty. Not all predictions have this level of certainty. 


Locating the Incorrect Predictions 


Next, we’d like to view some of the incorrectly predicted images to get a sense of the ones our 
model has trouble with. For example, if it’s always mispredicting 8s, perhaps we need more 


8s in our training data. 


Before we can view incorrect predictions, we need to locate them. Consider 

predictions [0] above. To determine whether the prediction was correct, we must 
compare the index of the largest probability in predictions [0] to the index of the element 
containing 1.0 in y test [0]. If these index values are the same, then the prediction was 
correct; otherwise, it was incorrect. NumPy’s argmax function determines the index of the 
highest valued element in its array argument. Let’s use that to locate the incorrect 
predictions. In the following snippet, p is the predicted value array, and e is the expected 


value array (the expected values are the labels for the dataset’s test images): 


lick here to view code image 


[44]: images = X_test.reshape((10000, 28 28) 


incorrect predictions = [] 


for i, ((p, ©) in enumerate (zip(predictions, y test): 


predicted, expected = np.argmax (p), np.argmax (e) 


if predicted != expected: 
incorrect predictions. append ( 


(i, images[i], predicted, expected) ) 


In this snippet, we first reshape the samples from the shape (28, 28, 1) that Keras 
required for learning back to (28, 28), which Matplotlib requires to display the images. 
Next, we populate the list incorrect predictions using the for statement. We zip the 
rows that represent each sample in the arrays predictions and y test, then enumerate 
those so we can capture their indexes. If the argmax results for p and e are different, then 
the prediction was incorrect, and we append a tuple to incorrect predictions 


containing that sample’s index, image, the predicted value and the expected value. We 


can confirm the total number of incorrect predictions (out of 10,000 images in the test set) 


with: 


[45]: len(incorrect_ predictions) 
[45] 3 83 


Visualizing Incorrect Predictions 


The following snippet displays 24 of the incorrect images labeled with each image’s index, 


predicted value (p) and expected value (e): 


lick here to view code image 


[46]: figure, axes = plt.subplots(nrows=4, ncols=6, figsize=(16, 12)) 


for axes, item in zip(axes.ravel(), incorrect predictions): 


index, image, predicted, expected = item 
axes.imshow(image, cmap=plt.cm.gray r) 
axes.set_xticks([]) # remove x-axis tick marks 
axes.set_yticks([]) # remove y-axis tick marks 


axes set terrien 
f'index: {index}\np: {predicted}; e: {expected}') 
pit-tight layout () 


Before reading the expected values, look at each digit and write down what digit you think it 


is. This is an important part of getting to know your data: 


index: 18 index: 340 index: 460 index: 495 index: 583 index: 619 
pi3;e: 5 p: 9xe:5 p:0;e:8 DiJ 622 


$5 867% 


index: 625 index: 659 index: 720 index: 924 index: 947 index: 1014 
Dp: 7; €: 2 p: 8;e:5 p: 7; e2 p:9;e:8 p: 5; €:.6 


1bo2a°¢ 


index: 1062 index: 1182 index: 1226 index: 1232 index: 1260 index: 1319 
p: 5; e: 6 p:2;e: 7 p:4;e:9 p: 1; e: 7 p:0;e: 8 


S +4 ) ®B 


index: 1393 index: 1414 index: 1522 index: 1530 index: 1611 index: 1621 
p: 3; e: 5 p:4;e:9 p: 9:e: 7 p: 7;e: 8 p: 8; e: 3 p: 6; e: 0 


417 8 6G 


Displaying the Probabilities for Several Incorrect Predictions 
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Let’s look at the probabilities of some incorrect predictions. The following function displays 


the probabilities for the specified prediction array: 


lick here to view code image 


[47]: def display probabilities (prediction): 
for index, probability in enumerate (prediction): 
print indexi: (probabi ity: Os jo) 


Though the 8 (at index 495) in the first line of the image output looks like an 8, our model 
had trouble with it. As you can see in the following output, the model predicted this image as 


a O, but also thought there was 16% chance it was a 6 and a 23% chance it was an 8: 


lick here to view code image 


[48]: display probabilities (predictions [495]) 
Oy 59. 72352623943 
1: 0.0000015465% 
22 078047289215% 
3: 0.0001740813% 
4: 0.0016636326% 
5: 0.0030567855% 
6: 16.1390662193% 
7: 0.0000001781% 
8: 23.3022540808% 
Oi 10.02 552706575 


The 2 (at index 583) in the first row was predicted to be a 7 with 62.7% certainty, but the 


model also thought there was a 36.4% chance it was a 2: 


lick here to view code image 


[49]: display probabilities (predictions [583]) 
0: 0.0000003016% 
1: 0.0000005715% 
2: 36.4056706429% 
3: 0. 0L76281916% 
4: 0.0000561930% 
5: 0.0000000003% 
6: 0.0000000019% 
7: 62.7455413342% 
82 Ol 8308162513 
9: 0.0000114385% 


The 6 (at index 625) at the beginning of the second row was predicted to be a 4, though that 
was far from certain. In this case, the probability of a 4 (51.6%) was only slightly higher than 
the probability of a 6 (48.38%): 


lick here to view code image 


oO 
oO 


]: display probabilities (predictions [625] ) 
0.0008245181% 
0.0000041209% 
0 0012774357% 
0.0000000009% 
51.6223073006% 
0.0000001779% 
48 .3754962683% 


Gy (Gr oS io) Nh i] oS 


7: 0.0000000085% 
8: 0.0000048182% 
9: 0.0000785786% 


15.6.6 Saving and Loading a Model 


Neural network models can require significant training time. Once you’ve designed and 
tested a model that suits your needs, you can save its state. This allows you to load it later to 
make more predictions. Sometimes models are loaded and further trained for new problems. 
For example, layers in our model already know how to recognize features such as lines and 
curves, which could be useful in handwritten character recognition (as in the EMNIST 
dataset) as well. So you could potentially load the existing model and use it as the basis for a 
more robust model. This process is called transfer learning » °—you transfer an existing 
model’s knowledge into a new model. A Keras model’s save method stores the model’s 
architecture and state information in a format called Hierarchical Data Format (HDF5). 


Such files use the .h5 file extension by default: 

5 ttps://towardsdatascience.com/transfer-learning-from-pre-trained- 
odels-f£2393f124751. 

6 


ttps://medium. com/nanonets/nanonets-how-to-use-deep-learning-when- 


ou-have-limited-data-f68c0b512cab. 
leli cnn save (maist cnn: Rs") 


You can load a saved model with the load_mode1 function from the 


tensorflow.keras.models module: 


lick here to view code image 


from tensorflow.keras.models import load_model 


cnn = load i model(imnirst cna hat) 


You can then invoke its methods. For example, if you’ve acquired more data, you could call 
predict to make additional predictions on new data, or you could call fit to start training 
with the additional data. 


Keras provides several additional functions that enable you to save and load various aspects 


of your models. For more information, see 


ttps://keras.io/getting-started/faq/#how-can-i-save-a-keras-model 


15.7 VISUALIZING NEURAL NETWORK TRAINING WITH 
TENSORBOARD 


With deep learning networks, there’s so much complexity and so much going on internally 
that’s hidden from you that it’s difficult to know and fully understand all the details. This 


creates challenges in testing, debugging and updating models and algorithms. Deep learning 


learns the features but there may be enormous numbers of them, and they may not be 


apparent to you. 


Google provides the TensorBoard 7, ° tool for visualizing neural networks implemented in 
TensorFlow and Keras. Just as a car’s dashboard visualizes data from your car’s sensors, such 
as your speed, engine temperature and the amount of gas remaining, a TensorBoard 
dashboard visualizes data from a deep learning model that can give you insights into how 
well your model is learning and potentially help you tune its hyperparameters. Here, we'll 


introduce TensorBoard. 














7? ttps://github.com/tensorflow/tensorboard/blob/master/ README. md. 


ttps://www.tensorflow.org/guide/summaries and _tensorboard. 


Executing TensorBoard 


TensorBoard monitors a folder on your system looking for files containing the data it will 
visualize in a web browser. Here, you'll create that folder, execute the TensorBoard server, 
then access it via a web browser. Perform the following steps: 

1. Change to the ch15 folder in your Terminal, shell or Anaconda Command Prompt. 

2. Ensure that your custom Anaconda environment tf_env is activated: 


conda activate tf_env 


3. Execute the following command to create a subfolder named 1ogs in which your deep- 


learning models will write the information that TensorBoard will visualize: 


mkdir logs 


4. Execute TensorBoard 


tensorboard --logdir=logs 


5. You can now access TensorBoard in your web browser at 


http://localhost:6006 


If you connect to TensorBoard before executing any models, it will initially display a page 


indicating “No dashboards are active for the current data set.” ° 


°TensorBoard does not currently work with Microsofts Edge browser. 


The TensorBoard Dashboard 


TensorBoard monitors the folder you specified looking for files output by the model during 


training. When TensorBoard sees updates, it loads the data into the dashboard: 


TensorBoard SCALARS IMAGES GRAPHS DISTRIBUTIONS HISTOGRAMS INACTIVE 1G} fos o) 


( Show data download links Q mnist 
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Tooltip sorting v 
thod: default 


Smoothing 
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SA RELATIVE WALL 965 tt 
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You can view the data as you train or after training completes. The dashboard above shows 
the TensorBoard SCALARS tab, which displays charts for individual values that change over 
time, such as the training accuracy (acc) and training loss (Loss) shown in the first row, and 
the validation accuracy (val_acc) and validation_loss (val_loss) shown in the second 


row. The diagrams visualize a 10-epoch run of our MNIST convnet, which we provided in the 





notebook MNIST CNN TensorBoard.ipynb. The epochs are displayed along the x-axes 
starting from o for the first epoch. The accuracy and loss values are displayed on the y-axes. 
Looking at the training and validation accuracies, you can see in the first 5 epochs similar 


results to the five-epoch run in the previous section. 


For the 10-epoch run, the training accuracy continued to improve through the 9th epoch, 
then decreased slightly. This might be the point at which we're starting to overfit, but we 
might need to train longer to find out. For the validation accuracy, you can see that it jumped 
up quickly, then was relatively flat for five epochs before jumping up then decreasing. For the 
training loss, you can see that it drops quickly, then continuously declines through the ninth 
epoch, before a slight increase. The validation loss dropped quickly then bounced around. We 
could run this model for more epochs to see whether results improve, but based on these 
diagrams, it appears that around the sixth epoch we get a nice combination of training and 


validation accuracy with minimal validation loss. 


Normally these diagrams are stacked vertically in the dashboard. We used the search field 
(above the diagrams) to show any that had the name “mnist” in their folder name—we'l 
configure that in a moment. TensorBoard can load data from multiple models at once and 
you can choose which to visualize. This makes it easy to compare several different models or 


multiple runs of the same model. 


Copy the MNIST Convnet’s Notebook 


To create the new notebook for this example: 


1. Right-click the MNIST CNN. ipynb notebook in JupyterLab’s File Browser tab and 


select Duplicate to make a copy of the notebook. 


2. Right-click the new notebook named MNIST_CNN-Copyl.ipynb, then select Rename, 





enter the name MNIST_CNN_TensorBoard.ipynb and press Enter. 


Open the notebook by double-clicking its name. 


Configuring Keras to Write the TensorBoard Log Files 


To use TensorBoard, before you fit the model, you need to configure a TensorBoard 
object (module tensorflow.keras.callbacks), which the model will use to write data 
into a specified folder that TensorBoard monitors. This object is known as a callback in 
Keras. In the notebook, click to the left of snippet that calls the model’s fit method, then 
type a, which is the shortcut for adding a new code cell above the current cell (use b for 


below). In the new cell, enter the following code to create the TensorBoard object: 


lick here to view code image 
from tensorflow.keras.callbacks import TensorBoard 
import time 


tensorboard callback = TensorBoard(log dir=f'./logs/mnist{time.time()}', 


histogram _freq=1, write graph=True) 


The arguments are: 


e log _dir—The name of the folder in which this model’s log files will be written. The 
notation './logs/' indicates that we’re creating a new folder within the logs folder you 
created previously, and we follow that with 'mnist' and the current time. This ensures 
that each new execution of the notebook will have its own log folder. That will enable you 


to compare multiple executions in TensorBoard. 


e histogram freq—The frequency in epochs that Keras will output to the model’s log 


files. In this case, we'll write data to the logs for every epoch. 


e write graph—When this is true, a graph of the model will be output. You can view the 
graph in the GRAPHS tab in TensorBoard. 


Updating Our Callto fit 


Finally, we need to modify the original fit method call in snippet 37. For this example, we set 
the number of epochs to 10, and we added the callbacks argument, which is a list of 


callback objects °: 
°You can view Kerass other callbacks at ttps://keras.io/callbacks/. 


lick here to view code image 


enn.fit(X train, y train, epochs=10, batch size=64, 
validation _split=0.1, callbacks=[tensorboard_callback]) 


You can now re-execute the notebook by selecting Kernel > Restart Kernel and Run All 


Cells in JupyterLab. After the first epoch completes, you'll start to see data in TensorBoard. 


15.8 CONVNETJS: BROWSER-BASED DEEP-LEARNING 
TRAINING AND VISUALIZATION 


In this section, we'll overview Andrej Karpathy’s JavaScript-based ConvnetJS tool for 


training and visualizing convolutional neural networks in your web browser: * 


"You also can download ConvnetJS from GitHub at 


ttps://github.com/karpathy/convnetjs. 


lick here to view code image 


ttps://cs.stanford.edu/people/karpathy/convnetjs/ 


You can run the ConvnetJS sample convolutional neural networks or create your own. We've 


used the tool on several desktop, tablet and phone browsers. 


The ConvnetJS MNIST demo trains a convolutional neural network using the MNIST dataset 
we presented in ection 15.6. The demo presents a scrollable dashboard that updates 


dynamically as the model trains and contains several sections. 


Training Stats 


This section contains a Pause button that enables you to stop the learning and “freeze” the 
current dashboard visualizations. Once you pause the demo, the button text changes to 
resume. Clicking the button again continues training. This section also presents training 


statistics, including the training and validation accuracy and a graph of the training loss. 


Instantiate a Network and Trainer 


In this section, you'll find the JavaScript code that creates the convolutional neural network. 
The default network has similar layers to the convnet in ection 15.6. The Conv-netJS 
documentation * shows the supported layer types and how to configure them. You can 
experiment with different layer configurations in the provided textbox and begin training an 


updated network by clicking the change network button. 
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ttps://cs.stanford.edu/people/karpathy/convnetjs/docs.html. 


Network Visualization 


This key section shows one training image at a time and how the network processes that 
image through each layer. Click the Pause button to inspect all the layers’ outputs for a given 
digit to get a sense of what the network “sees” as it learns. The network’s last layer produces 
the probabilistic classifications. It shows 10 squares—9 black and 1 white, indicating the 


predicted class of the current digit image. 


Example Predictions on Test Set 


The final section shows a random selection of the test set images and the top three possible 


classes for each digit. The one with the highest probability is shown on a green bar and the 


other two are displayed on red bars. The length of each bar is a visual indication of that class’s 


probability. 


15.9 RECURRENT NEURAL NETWORKS FOR SEQUENCES; 
SENTIMENT ANALYSIS WITH THE IMDB DATASET 


In the MNIST CNN network, we focused on stacked layers that were applied sequentially. 
Non-sequential models are possible, as you'll see here with recurrent neural networks. In 
this section, we use Keras’s bundled IMDb (the Internet Movie Database) movie reviews 
dataset ° to perform binary classification, predicting whether a given review’s sentiment 


is positive or negative. 


3Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, 
Andrew Y. and Potts, Christopher, “Learning Word Vectors for Sentiment Analysis,” 
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: 
Human Language Technologies, June 2011. Portland, Oregon, USA. Association for 
Computational Linguistics, pp. 142150. ttp://www.aclweb.org/anthology/P11- 
OSs 


We'll use a recurrent neural network (RNN), which processes sequences of data, such as 
time series or text in sentences. The term “recurrent” comes from the fact that the neural 
network contains loops in which the output of a given layer becomes the input to that same 
layer in the next time step. In a time series, a time step is the next point in time. In a text 


sequence, a “time step” would be the next word in a sequence of words. 


The looping in RNNs enables them to learn and remember relationships among the data in 
the sequence. For example, consider the following sentences we used in the“ atural 


anguage Processing” chapter. The sentence 
the food is; not good. 
clearly has negative sentiment. Similarly, the sentence 
The movie was good. 
has positive sentiment, though not as positive as 
The movie was excellent! 
In the first sentence, the word “good” on its own has positive sentiment. However, when 


preceded by “not,” which appears earlier in the sequence, the sentiment becomes negative. 


RNNs take into account the relationships among the earlier and later parts of a sequence. 


In the preceding example, the words that determined sentiment were adjacent. However, 
when determining the meaning of text there can be many words to consider and an arbitrary 
number of words in between them. In this section, we'll use a Long Short-Term Memory 


(LSTM) layer, which makes the neural network recurrent and is optimized to handle 


learning from sequences like the ones we described above. 


5 6 
RNNs have been used for many tasks including: * ° 


4 ttps://www.analyticsindiamag.com/overview-of-recurrent-neural- 


etworks-and-their-applications/. 





ttps://en.wikipedia.org/wiki/Recurrent neural network#Applications. 


ttp://karpathy.github.i0/2015/05/21/rnn-effectiveness/. 


e predictive text input—displaying possible next words as you type, 

e sentiment analysis, 

e responding to questions with the predicted best answers from a corpus, 
e inter-language translation, and 


e automated closed captioning in video. 


15.9.1 Loading the IMDb Movie Reviews Dataset 


The IMDb movie reviews dataset included with Keras contains 25,000 training samples and 
25,000 testing samples, each labeled with its positive (1) or negative (0) sentiment. Let’s 


import the tensorflow.keras.datasets.imdb module so we can load the dataset: 


lick here to view code image 
eis Seco tensorflow.keras.datasets import imdb 


The imdb module’s load_data function returns the IMDb training and testing sets. There 
are over 88,000 unique words in the dataset. The load_data function enables you to specify 
the number of unique words to import as part of the training and testing data. In this case, we 
loaded only the top 10,000 most frequently occurring words due to the memory limitations of 
our system and the fact that we’re (intentionally) training on a CPU rather than a GPU 
(because most of our readers will not have access to systems with GPUs and TPUs). The more 


data you load, the longer training will take, but more data may help produce better models: 


lick here to view code image 


[2]: number of words = 10000 


(Sc) (X train, y trarn), (X test, y Cest) 


imdb.load_data ( 
num_words=number_ of words) 


The load_data function returns a tuple of two elements containing the training and testing 
sets. Each element is itself a tuple containing the samples and labels, respectively. In a given 
review, load_data replaces any words outside the top 10,000 with a placeholder value, 


which we'll discuss shortly. 


15.9.2 Data Explorat n 


Let’s check the dimensions of the training set samples (x_t rain), training set labels 


(y_train), testing set samples (x_test) and testing set labels (y_ test): 


lick here to view code image 


4]: X train shape 
Al: (25000; 
5|: y train-shape 
Sle (2500:01 


6]: X_test.shape 
(25000, ) 


7]: y_test.shape 
Wee (25001077) 








The arrays y trainand y test are one-dimensional arrays containing 1s and os, 
indicating whether each review is positive or negative. Based on the preceding outputs, 
X trainand X_ test also appear to be one-dimensional. However, their elements actually 


are lists of integers, each representing one review’s contents, as shown in snippet [9]: 7 


7Here we used the $pprint magic to turn off pretty printing so the following snippets 
output could be displayed horizontally rather than vertically to save space. You can turn 


pretty printing back on by re-executing the $pprint magic. 


lick here to view code image 
[Sl Spprink 
[8]: Pretty printing has been turned OFF 


[9]: X train[123] 
Ponce O S Ole 20; a6 aS en Su S2 e S oia a A S 2} 











eras deep learning models require numeric data, so the Keras team preprocessed the IMDb 


dataset for you. 


Movie Review Encodings 


Because the movie reviews are numerically encoded, to view their original text, you need to 
know the word to which each number corresponds. Keras’s IMDb dataset provides a 
dictionary that maps the words to their indexes. Each word’s corresponding value is its 
frequency ranking among all the words in the entire set of reviews. So the word with the 
ranking 1 is the most frequently occurring word (calculated by the Keras team from the 


dataset), the word with ranking 2 is the second most frequently occurring word, and so on. 


Though the dictionary values begin with 1 as the most frequently occurring word, in each 
encoded review (like X_train[123] shown previously), the ranking values are offset by 3. 
So any review containing the most frequently occurring word will have the value 4 wherever 


that word appears in the review. Keras reserves the values 0, 1 and 2 in each encoded review 


for the following purposes: 


e The value o in a review represents padding. Keras deep learning algorithms expect all the 
training samples to have the same dimensions, so some reviews may need to be expanded 
to a given length and some shortened to that length. Reviews that need to be expanded 
are padded with os. 


e The value 1 represents a token that Keras uses internally to indicate the start of a text 


sequence for learning purposes. 


e The value 2 in a review represents an unknown word—typically a word that was not 
loaded because you called load_data with the num_words argument. In this case, any 
review that contained words with frequency rankings greater than num_words would 
have those words’ numeric values replaced with 2. This is all handled by Keras when you 
load the data. 


Because each review’s numeric values are offset by 3, we’ll have to account for this when we 


decode the review. 


Decoding a Movie Review 


Let’s decode a review. First, get the word-to-index dictionary by calling the function 


get_word_index from the tensorflow.keras.datasets.imdb module: 


lick here to view code image 


[10]: word to index = imdb.get_word_index () 


The word 'great' might appear in a positive movie review, so let’s see whether it’s in the 


dictionary: 


lick here to view code image 


HI: word to indexi "great | 
[11]: 84 


According to the output, 'great' is the dataset’s 84th most frequent word. If you look up a 


word that’s not in the dictionary, you'll get an exception. 


To transform the frequency ratings into words, let’s first reverse the word to index 
dictionary’s mapping, so we can look up every word by its frequency rating. The following 


dictionary comprehension reverses the mapping: 


lick here to view code image 


[12]: index to word = \ 


{index: word for (word, index) in word_to_index.items () } 


Recall that a dictionary’s items method enables us to iterate through tuples of key—value 


pairs. We unpack each tuple into the variables word and index, then create an entry in the 


new dictionary with the expression index: word. 


The following list comprehension gets the top 50 words from the new dictionary—recall that 


the most frequent word has the value 1: 


lick here to view code image 


[Sis index to wordli] Lor as rm rangel I0) 


e s iao e E o A Sa tional rs EGN aV E eoe Aki. 














ote that most of these are stop words. Depending on the application, you might want to 
remove or keep the stop words. For example, if you were creating a predictive-text 
application that suggests the next word in a sentence the user is typing, you’d want to keep 


the stop words so they can be displayed as predictions. 


Now, we can decode a review. We use the index to word dictionary’s two-argument 
method get rather than the [] operator to get value for each key. If a value is not in the 
dictionary, the get method returns its second argument, rather than raising an exception. 
The argument i - 3 accounts for the offset in the encoded reviews of each review’s 
frequency ratings. When the Keras reserved values 0—2 appear in a review, get returns '?'; 


otherwise, get returns the word with the key i - 3 inthe index _to_word dictionary: 


lick here to view code image 


Ha * * oin landes to word- get (a = oi) Vet) Eor a in A eisai (2 3 |) 

[14]: '? beautiful and touching movie rich colors great settings good 
acting and one of the most charming movies i have seen in a while i 
never saw such an interesting setting when i was in china my wife 
Tiked at so much she asked me to 2? on and rate at iso other would 


enjoy too' 
We can see from the y train array that this review is classified as positive: 


Halk syetraaim| 23] 
palsy ea al 


15.9.3 Data Preparation 


The number of words per review varies, but the Keras requires all samples to have the same 
dimensions. So, we need to perform some data preparation. In this case, we need to restrict 
every review to the same number of words. Some reviews will need to be padded with 
additional data and others will need to be truncated. The pad_sequences utility function 
(module tensorflow.keras.preprocessing. sequence) reshapes X_train’s samples 
(that is, its rows) to the number of features specified by the maxlen argument (200) and 


returns a two-dimensional array: 


lick here to view code image 


[16]: words per review = 200 
Livi from tensorflow.keras.preprocessing.sequence import pad_sequences 


[LB]: X train = pad sequences (xX train, maxlen=words per review) 


If a sample has more features, pad_sequences truncates it to the specified length. If a 
sample has fewer features, pad_sequences adds 0s to the beginning of the sequence to pad 


it to the specified length. Let’s confirm X_train’s new shape: 


[19]; X theain. shape 
PLSI (25000), 200) 


We also must reshape X_test for later in this example when we evaluate the model: 
lick here to view code image 


[20]: X test = pad seguences (X test, maxlen=words_per_review) 


[21]: X test- shape 
[21]: (25000, 200) 


Splitting the Test Data into Validation and Test Data 


In our convnet, we used the fit method’s validation split argument to indicate that 





10% of our training data should be set aside to validate the model as it trains. For this 
example, we'll manually split the 25,000 test samples into 20,000 test samples and 5,000 
validation samples. We'll then pass the 5,000 validation samples to the model’s fit method 
via the argument validation_data. Let’s use Scikit-learn’s train_test_ split function 


from the previous chapter to split the test set: 


lick here to view code image 


2i: Erom sklearn.model_ selection import train test split 
X test, X val, y test; y val = train test split 
X test; y test, random _state=11, test_size=0.20) 


Let’s also confirm the split by checking X_test’s and X_val’s shapes: 
[23]: X test- shape 


[23] 3 (200100), 200) 


[24]: X_val.shape 
[24] 2 CS 200) 


15.9.4 Creating the Neural Network 


Next, we'll configure the RNN. Once again, we begin with a Sequential model to which 


we'll add the layers that compose our network: 


lick here to view code image 


Ades Erom tensorflow.keras.models import Sequential 


[26]: rnn = Sequential () 


Next, let’s import the layers we'll use in this model: 


lick here to view code image 


Zi ee eeacom tensorflow.keras.layers import Dense, LSTM 


[23] 3 from tensorflow.keras.layers.embeddings import Embedding 


Adding an Embedding Layer 


Previously, we used one-hot encoding to convert the MNIST dataset’s integer labels into 
categorical data. The result for each label was a vector in which all but one element was o. 
We could do that for the index values that represent our words. However, this example 
processes 10,000 unique words. That means we’d need a 10,000-by-10,000 array to 
represent all the words. That’s 100,000,000 elements, and almost all the array elements 
would be o. This is not an efficient way to encode the data. If we were to process all 88,000+ 


unique words in the dataset, we’d need an array of nearly eight billion elements! 


To reduce dimensionality, RNNs that process text sequences typically begin with an 
embedding layer that encodes each word in a more compact dense-vector representation. 
The vectors produced by the embedding layer also capture the word’s context—that is, how a 
given word relates to the words around it. So the embedding layer enables the RNN to learn 


word relationships among the training data. 


There are also predefined word embeddings, such as Word2Vec and GloVe. You can 
load these into neural networks to save training time. They’re also sometimes used to add 
basic word relationships to a model when smaller amounts of training data are available. This 
can improve the model’s accuracy by allowing it to build upon previously learned word 
relationships, rather than trying to learn those relationships with insufficient amounts of 
data. 


Let’s create an Embedding layer (module tensorflow.keras.layers): 


lick here to view code image 


[29]: rnn.add(Embedding(input_dim=number of words, output_dim=128, 


input_length=words per review) ) 


The arguments are: 
e input _dim—The number of unique words. 


e output _dim—The size of each word embedding. If you load pre-existing embeddings 3 


like Word2Vec and GloVe, you must set this to match the size of the word embeddings 


you load. 


$ ttps://blog.keras.io/using-pre-trained-word-embeddings-in-a- 


eras-model.html. 


e input_length=words_per_review—The number of words in each input sample. 


Adding an LSTM Layer 
Next, we'll add an LSTM layer: 


lick here to view code image 


[30]: rnn.add(LSTM(units=128, dropout=0.2, recurrent_dropout=0.2)) 


The arguments are: 


e units—The number of neurons in the layer. The more neurons the more the network can 


remember. As a guideline, you can start with a value between the length of the sequences 


you re processing (200 in this example) and the number of classes youre trying to predict 


9 
(2 in this example). 


? ttps://towardsdatascience.com/choosing-the-right- 


yperparameters-—for-a-simple-lstm-using-keras-f8e9ed76f046. 


e dropout—The percentage of neurons to randomly disable when processing the layer’s 
input and output. Like the pooling layers in our convnet, dropout is a proven 
technique ° * that reduces overfitting. Keras provides a Dropout layer that you can add 


to your models. 


°Yarin, Ghahramani, and Zoubin. A Theoretically Grounded Application of Dropout in 
Recurrent Neural Networks. October 05, 2016. 
ttps://arxiv.org/abs/1512.05287. 


‘Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan 
Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. 
Journal of Machine Learning Research 15 (June 14, 2014): 1929-1958. 

ttp://jmlr.org/papers/volumel5/srivastaval4a/srivastaval4a.pdf. 


e recurrent _dropout—The percentage of neurons to randomly disable when the layer’s 


output is fed back into the layer again to allow the network to learn from what it has seen 


previously. 


The mechanics of how the LSTM layer performs its task are beyond the scope of this book. 
Chollet says: “you don’t need to understand anything about the specific architecture of an 
LSTM cell; as a human, it shouldn’t be your job to understand it. Just keep in mind what the 


LSTM cell is meant to do: allow past information to be reinjected at a later time.” * 


Chollet, Francois. Deep Learning with Python. p. 204. Shelter Island, NY: Manning 
Publications, 2018. 


Adding a Dense Output Layer 


Finally, we need to take the LSTM layer’s output and reduce it to one result indicating 
whether a review is positive or negative, thus the value 1 for the units argument. Here we 
use the 'sigmoid' activation function, which is preferred for binary classification. ° It 


reduces arbitrary values into the range 0.0—1.0, producing a probability: 


3Chollet, Francois. Deep Learning with Python. p.114. Shelter Island, NY: Manning 
Publications, 2018. 


lick here to view code image 


[31]: rnn.add(Dense(units=1, activation='sigmoid')) 


Compiling the Model and Displaying the Summary 


Next, we compile the model. In this case, there are only two possible outputs, so we use the 


binary crossentropy loss function: 


lick here to view code image 


[32]: rnn.compile (optimizer='adam', 
loss='binary_crossentropy', 


metrics=['accuracy']) 


The following is the summary of our model. Notice that even though we have fewer layers 
than our convnet, the RNN has nearly three times as many trainable parameters (the 
network’s weights) as the convnet and more parameters means more training time. The large 


number of parameters primarily comes from the number of words in the vocabulary (we 





loaded 10,000) times the number of neurons in the Embedding layer’s output (128): 


lick here to view code image 

















[33]: rnn.summary () 

Layer (type) Output Shape Param # 
embedding 1 (Embedding) (None, 200, 128) 1280000 
ilstm_1 (LSTM) (None, 128) 131584 
dense 1 (Dense) (None, 1) 129 








Total params: 1,411,713 
Trainable params: 1,411,713 


Non-trainable params: 0 


15.9.5 Training and Evaluating the Model 


Let’s train our model. 4 Notice for each epoch that the model takes significantly longer to 
train than our convnet did. This is due to the larger numbers of parameters (weights) our 
RNN model needs to learn. We bolded the accuracy (acc) and validation accuracy 
(val_acc) values for readability—these represent the percentage of training samples and the 


percentage of validation data samples that the model predicts correctly. 


4At the time of this writing, TensorFlow displayed a warning when we executed this 
statement. This is a known TensorFlow issue and, according to the forums, you can safely 


ignore the warning. 


lick here to view code image 


[34]: rnn: fit (x train, y train, epochs=10, batch _size=32, 
validation _data=(X_test, y_test)) 


Train on 25000 samples, validate on 5000 samples 









































Epoch 1/5 
25000/25000 = 2995 12ms/step - loss: 0.6574 
poch 2/5 
25000/25000 = 2986 12ms/step - loss: 0.4577 
poch 3/5 
25000/25000 = 2965 i2ms/step = oss: 023277. 
poeh 4/5 
25000/25000 = 3075 12ms/step ~ loss: 0.2675 
poch 5/5 
25000/25000 = 3106 L2msi/istep: = loss: 0.2217 


34]: <tensorflow.python.keras.callbacks.History object at Oxb3ba882e8> 





4 | | > 





Finally, we can evaluate the results using the test data. Function evaluate returns the loss 


and accuracy values. In this case, the model was 85.99% accurate: 


lick here to view code image 


[35]: results = rnn.evaluate(X_test, y test) 
20000/20000 [ ] - 42s 2ms/step 








[36]: results 
136]: [0.3415240607559681, 0.8599) 


Note that the accuracy of this model seems low compared to our MNIST convnet’s results, 
but this is a much more difficult problem. If you search online for other IMDb sentiment- 
analysis binary-classification studies, you'll find lots of results in the high 80s. So we did 
reasonably well with our small recurrent neural network of only three layers. You might want 


to study some online models and try to produce a better model. 


15.10 TUNING DEEP LEARNING MODELS 


In ection 15.9.5, notice in the fit method’s output that both the testing accuracy (85.99%) 
and validation accuracy (87.04%) were significantly less than the 90.83% training accuracy. 
Such disparities are usually the result of overfitting, so there is plenty of room for 
improvement in our model. * ° If you look at the output of each epoch, you'll notice both the 


training and validation accuracy continue to increase. Recall that training for too many 


pochs can lead to overfitting, but it’s possible we have not yet trained enough. Perhaps one 


hyperparameter tuning option for this model would be to increase the number of epochs. 


5 ttps://towardsdatascience.com/deep-learning-overfitting- 


46bf5b35e24. 


$ ttps://hackernoon.com/memorizing-is-not-learning-6-tricks-to- 





revent-overfitting-in-machine-learning-820b091dc42. 


Some variables that affect your models’ performance include: 


e having more or less data to train with 
e having more or less to test with 
e having more or less to validate with 


e having more or fewer layers 


the types of layers you use 


the order of the layers 


In our IMDb RNN example, some things we could tune include: 


e trying different amounts of the training data—we used only the top 10,000 words 
e different numbers of words per review—we used only 200, 

e different numbers of neurons in our layers, 

e more layers or 


e possibly loading pre-trained word vectors rather than having our Embedding layer learn 


them from scratch. 


The compute time required to train models multiple times is significant so, in deep learning, 
you generally do not tune hyperparameters with techniques like k-fold cross-validation or 
grid search. 7 There are various tuning techniques, °, °, °, ' but one particularly promising 
area is automated machine learning (AutoML). For example, the Auto-Keras * library is 
specifically geared to automatically choosing the best configurations for your Keras models. 
Google’s Cloud AutoML and Baidu’s EZDL are among various other automated machine 


learning efforts. 


7 ttps://www.quora.com/Is-cross-validation-heavily-used-in-deep- 





earning-or-is-it-too-expensive-to-be-used. 


8 ttps://towardsdatascience.com/what-are-hyperparameters-and-how-to- 


une-the-hyperparameters-in-a-deep-neural-network-d0604917584a. 





? ttps://medium.com/machine-learning-bites/deeplearning-series-deep- 





eural-networks-tuning-and-optimization-39250ff778 6d. 


° ttps://flyyufelix.github.io/2016/10/03/fine-tuning-in-keras- 


artl.htmland ttps://flyyufelix.github.io/2016/10/08/fine-tuning-in- 
eras-part2.html. 











* ttps://towardsdatascience.com/a-comprehensive-guide-on-how-to- 








fine-tune-deep-neural-networks-using-keras-on-google-colab-free- 


aaaaQaced8f. 


? ttps://autokeras.com/. 


5.11 CONVNET MODELS PRETRAINED ON IMAGENET 


With deep learning, rather than starting fresh on every project with costly training, validating 


and testing, you can use pretrained deep neural network models to: 
e make new predictions, 
e continue training them further with new data or 


e transfer the weights learned by a model for a similar problem into a new model—this is 
called transfer learning. 


Keras Pretrained Convnet Models 


Keras comes bundled with the following pretrained convnet models, 3 each pretrained on 


Image-Net *—a growing dataset of 14+ million images: 


3 ttps://keras.io/applications/. 





ttp://www.image-net.org. 


Xception 


VGG16 


VGG19 


e ResNet50 


Inception v3 


Inception-ResNet v2 


e MobileNet v1 


e DenseNet 


e NASNet 


e MobileNet v2 


Reusing Pretrained Models 


ImageNet is too big for efficient training on most computers, so most people interested in 


using it start with one of the smaller pretrained models. 


You can reuse just the architecture of each model and train it with new data, or you can reuse 


the pretrained weights. For a few simple examples, see: 


ttps://keras.io/applications/ 


ImageNet Challenge 


In the end-of-chapter projects, you'll research and use some of these bundled models. You'll 
also investigate the ImageNet Large Scale Visual Recognition Challenge for evaluating 
object-detection and image-recognition models. ° This competition ran from 2010 through 
2017. ImageNet now has a continuously running challenge on the Kaggle competition site 
called the ImageNet Object Localization Challenge. ê The goal is to identify “all objects 
within an image, so those images can then be classified and annotated.” ImageNet releases 


the current participants leaderboard once per quarter. 


5 ttp://www.image-net.org/challenges/LSVRC/. 


6 ttps://www.kaggle.com/c/imagenet-object-localization-challenge. 

A lot of what you’ve seen in the machine learning and deep learning chapters is what the 
Kaggle competition website is all about. There’s no obvious optimal solution for many 
machine learning and deep learning tasks. People’s creativity is really the only limit. On 
Kaggle, companies and organizations fund competitions where they encourage people 
worldwide to develop better-performing solutions than they’ve been able to do for something 
that’s important to their business or organization. Sometimes companies offer prize money, 
which has been as high as $1,000,000 on the famous Netflix competition. Netflix wanted to 
get a 10% or better improvement in their model for determining whether people will like a 
movie, based on how they rated previous ones. ” They used the results to help make better 
recommendations to members. Even if you do not win a Kaggle competition, it’s a great way 


to get experience working on problems of current interest. 


7 ttps://netflixprize.com/rules.html. 


15.12 WRAP-UP 


In hapter 16, you peered into the future of AI. Deep Learning has captured the imagination 
of the computer-science and data science-communities. This may be the most important AI 


chapter in the book. 


We mentioned the key deep-learning platforms, indicating that Google’s TensorFlow is the 
most widely used. We discussed why Keras, which presents a friendly interface to 


TensorFlow, has become so popular. 


e set up a custom Anaconda environment for TensorFlow, Keras and JupyterLab, then used 


the environment to implement the Keras examples. 


We explained what tensors are and why they’re crucial to deep learning. We discussed the 
basics of neurons and multi-layered neural networks for building Keras deep-learning 


models. We considered some popular types of layers and how to order them. 


We introduced convolutional neural networks (convnets) and indicated that they’re especially 
appropriate for computer-vision applications. We then built, trained, validated and tested a 
convnet using the MNIST database of handwritten digits for which we achieved 99.17% 
prediction accuracy. This is remarkable, given that we achieved it by working with a only a 
basic model and without doing any hyperparameter tuning. You can try more sophisticated 
models and tune the hyperparameters to try to achieve better performance. We listed a 


variety of intriguing computer vision tasks. 


We introduced TensorBoard for visualizing TensorFlow and Keras neural network training 
and validation. We also discussed ConvnetJS, a browser-based convnet training and 


visualization tool, which enables you to peek inside the training process. 


Next, we presented recurrent neural networks (RNNs) for processing sequences of data, such 
as time series or text in sentences. We used an RNN with the IMDb movie reviews dataset to 
perform binary classification, predicting whether each review’s sentiment was positive or 
negative. We also discussed tuning deep learning models and how high-performance 
hardware, like NVIDIA’s GPUs and Google’s TPUs, is making it possible for more people to 


tackle more substantial deep-learning studies. 


Given how costly and time-consuming it is to train deep-learning models, we explained the 
strategy of using pretrained models. We listed various Keras convnet image-processing 
models that were trained on the massive ImageNet dataset, and discussed how transfer 
learning enables you to use these models to create new ones quickly and effectively. Deep 


learning is a large, complex topic. We focused on the basics in the chapter. 


In the next chapter, we present the big data infrastructure that supports the kinds of AI 
technologies we’ve discussed in hapters 12 through 5. We'll consider the Hadoop and Spark 
platforms for big data batch processing and real-time streaming applications. We'll look at 
relational databases and the SQL language for querying them—these have dominated the 
database field for many decades. We'll discuss how big data presents challenges that 
relational databases don’t handle well, and consider how NoSQL databases are designed to 
handle those challenges. We'll conclude the book with a discussion of the Internet of Things 
(IoT), which will surely be the world’s largest big-data source and will present many 
opportunities for entrepreneurs to develop leading-edge businesses that will truly make a 


difference in people’s lives. 


https://avxhm.se/blogs/hillO 


16. Big Data: Hadoop, Spark, NoSQL and loT 


Objectives 

In this chapter you'll: 

m Understand what big data is and how quickly it’s getting bigger. 

m Manipulate a SQLite relational database using Structured Query Language (SQL). 
m Understand the four major types of NoSQL databases. 


mw Store tweets in a MongoDB NoSQL JSON document database and visualize them on a 


Folium map. 
m Understand Apache Hadoop and how it’s used in big-data batch-processing applications. 
m Build a Hadoop MapReduce application on Microsoft’s Azure HDInsight cloud service. 


m Understand Apache Spark and how it’s used in high-performance, real-time big-data 
applications. 


m Use Spark streaming to process data in mini-batches. 
m Understand the Internet of Things (IoT) and the publish/subscribe model. 


m Publish messages from a simulated Internet-connected device and visualize its messages in 
a dashboard. 


m Subscribe to PubNub’s live Twitter and IoT streams and visualize the data. 
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16.1 INTRODUCTION 


In ection 1.7, we introduced big data. In this capstone chapter, we discuss popular hardware 
and software infrastructure for working with big data, and we develop complete applications 
on several desktop and cloud-based big-data platforms. 


Databases 


Databases are critical big-data infrastructure for storing and manipulating the massive 
amounts of data we’re creating. They’re also critical for securely and confidentially 
maintaining that data, especially in the context of ever-stricter privacy laws such as HIPAA 
(Health Insurance Portability and Accountability Act) in the United States and 
GDPR (General Data Protection Regulation) for the European Union. 


First, we'll present relational databases, which store structured data in tables with a 
fixed-size number of columns per row. You'll manipulate relational databases via 
Structured Query Language (SQL). 


Most data produced today is unstructured data, like the content of Facebook posts and 
Twitter tweets, or semi-structured data like JSON and XML documents. Twitter processes 
each tweet’s contents into a semi-structured JSON document with lots of metadata, as you 
saw in the “Data Mining Twitter” chapter. Relational databases are not geared to the 
unstructured and semi-structured data in big-data applications. So, as big data evolved, new 
kinds of databases were created to handle such data efficiently. We’ll discuss the four major 
types of these NoSQL databases—key—value, document, columnar and graph databases. 
Also, we'll overview NewSQL databases, which blend the benefits of relational and NoSQL 
databases. Many NoSQL and NewSQL vendors make it easy to get started with their products 
through free tiers and free trials, and typically in cloud-based environments that require 
minimal installation and setup. This makes it practical for you to gain big-data experience 


before “diving in.” 


Apache Hadoop 


Much of today’s data is so large that it cannot fit on one system. As big data grew, we needed 


istributed data storage and parallel processing capabilities to process the data more 
efficiently. This led to complex technologies like Apache Hadoop for distributed data 
processing with massive parallelism among clusters of computers where the intricate details 
are handled for you automatically and correctly. We’ll discuss Hadoop, its architecture and 
how it’s used in big-data applications. We'll guide you through configuring a multi-node 
Hadoop cluster using the Microsoft Azure HDInsight cloud service, then use it to execute a 
Hadoop MapReduce job that you'll implement in Python. Though HDInsight is not free, 
Microsoft gives you a generous new-account credit that should enable you to run the 
chapter’s code examples without incurring additional charges. 


Apache Spark 


As big-data processing needs grow, the information-technology community is continually 
looking for ways to increase performance. Hadoop executes tasks by breaking them into 
pieces that do lots of disk I/O across many computers. Spark was developed as a way to 
perform certain big-data tasks in memory for better performance. 


We'll discuss Apache Spark, its architecture and how it’s used in high-performance, real-time 
big-data applications. You'll implement a Spark application using functional-style 
filter/map/reduce programming capabilities. First, you'll build this example using a Jupyter 
Docker stack that runs locally on your desktop computer, then you’ll implement it using a 
cloud-based Microsoft Azure HDInsight multi-node Spark cluster. 


We'll introduce Spark streaming for processing streaming data in mini-batches. Spark 
streaming gathers data for a short time interval you specify, then gives you that batch of data 
to process. Youll implement a Spark streaming application that processes tweets. In that 
example, you'll use Spark SQL to query data stored in a Spark DataFrame which, unlike 


pandas DataFrames, may contain data distributed over many computers in a cluster. 


Internet of Things 


We'll conclude with an introduction to the Internet of Things (IoT)—billions of devices that 
are continuously producing data worldwide. We'll introduce the publish/subscribe model 
that IoT and other types of applications use to connect data users with data providers. First, 
without writing any code, you'll build a web-based dashboard using Freeboard.io and a 
sample live stream from the PubNub messaging service. Next, you'll simulate an Internet- 
connected thermostat which publishes messages to the free Dweet.io messaging service using 
the Python module Dweepy, then create a dashboard visualization of the data with 
Freeboard.io. Finally, you'll build a Python client that subscribes to a sample live stream from 
the PubNub service and dynamically visualizes the stream with Seaborn and a Matplotlib 


FuncAnimation. 


Experience Cloud and Desktop Big-Data Software 


Cloud vendors focus on service-oriented architecture (SOA) technology in which they 
provide “as-a-Service” capabilities that applications connect to and use in the cloud. Common 
services provided by cloud vendors include: * 


1 For more as-a-Service acronyms, see 
ttps://en.wikipedia.org/wiki/Cloud_computing and 


ttps://en.wikipedia.org/wiki/As_ a service 


“As-a-Service” acronyms (note that several are the same) 





Big data as a Service (BDaaS) Platform as a Service (PaaS) 
Hadoop as a Service (HaaS) Software as a Service (SaaS) 
Hardware as a Service (HaaS) Storage as a Service (SaaS) 


Infrastructure as a Service (IaaS) Spark as a Service (SaaS) 


You'll get hands-on experience in this chapter with several cloud-based tools. In this 
chapter’s examples, you'll use the following platforms: 


e A free MongoDB Atlas cloud-based cluster. 


e A multi-node Hadoop cluster running on Microsoft’s Azure HDInsight cloud-based 
service—for this you'll use the credit that comes with a new Azure account. 


e A free single-node Spark “cluster” running on your desktop computer, using a Jupyter 
Docker-stack container. 


e A multi-node Spark cluster, also running on Microsofts Azure HDInsight—for this you'll 


continue using your Azure new-account credit. 


There are many other options, including cloud-based services from Amazon Web Services, 
Google Cloud and IBM Watson, and the free desktop versions of the Hortonworks and 
Cloudera platforms (there also are cloud-based paid versions of these). You also could try a 
single-node Spark cluster running on the free cloud-based Databricks Community Edition. 
Spark’s creators founded Databricks. 


Always check the latest terms and conditions of each service you use. Some 
require you to enable credit-card billing to use their clusters. Caution: Once you 
allocate Microsoft Azure HDInsight clusters (or other vendors’ clusters), they 
incur costs. When you complete the case studies using services such as 
Microsoft Azure, be sure to delete your cluster(s) and their other resources (like 
storage). This will help extend the life of your Azure new-account credit. 


Installation and setups vary across platforms and over time. Always follow each vendor's 
latest steps. If you have questions, the best sources for help are the vendor’s support 
capabilities and forums. Also, check sites such as stackoverf low. com—other people may 
have asked questions about similar problems and received answers from the developer 


community. 


Algorithms and Data 


Algorithms and data are the core of Python programming. The first few chapters of this book 
were mostly about algorithms. We introduced control statements and discussed algorithm 


development. Data was small—primarily individual integers, floats and strings. hapters 5- 


emphasized structuring data into lists, tuples, dictionaries, sets, arrays and files. 


Data’s Meaning 


But, what about the meaning of the data? Can we use the data to gain insights to better 
diagnose cancers? Save lives? Improve patients’ quality of life? Reduce pollution? Conserve 
water? Increase crop yields? Reduce damage from devastating storms and fires? Develop 
better treatment regimens? Create jobs? Improve company profitability? 


The data-science case studies of hapters 11- 5 all focused on AI. In this chapter, we focus on 
the big-data infrastructure that supports AI solutions. As the data used with these 
technologies continues growing exponentially, we want to learn from that data and do so at 
blazing speed. We'll accomplish these goals with a combination of sophisticated algorithms, 
hardware, software and networking designs. We’ve presented various machine-learning 
technologies, seeing that there are indeed great insights to be mined from data. With more 
data, and especially with big data, machine learning can be even more effective. 


Big-Data Sources 


The following articles and sites provide links to hundreds of free big data sources: 


“Awesome-Public-Datasets,” GitHub.com, 


ttps://github.com/caesar0301/awesome-public-datasets. 


“AWS Public Datasets,” ttps://aws.amazon.com/public-datasets/. 


“Big Data And AI: 30 Amazing (And Free) Public Data Sources For 2018,” by : 
ttps://www.forbes.com/sites/bernardmarr/2018/02/26/big-data-and- 


mazing-and-free-public-data-sources-for-2018/. 


“Datasets for Data Mining and Data Science,” 


ttp://www.kdnuggets.com/datasets/index.html. 


“Exploring Open Data Sets,” ttps://datascience.berkeley.edu/open-dat 





ets/. 


“Free Big Data Sources,” Datamics, ttp://datamics.com/free-big-data-so 


Hadoop Illuminated, hapter 16. Publicly Available Big Data Sets, 
ttp://hadoopilluminated.com/hadoop illuminated/Public Bigdata_Se 


List of Public Data Sources Fit for Machine Learning,” 
ttps://blog.bigml.com/list-of-public-data-sources-fit-for-machin 


earning/. 


“Open Data,” Wikipedia, ttps://en.wikipedia.org/wiki/Open data. 


“Open Data 500 Companies,” ttp://www.opendata500.com/us/list/. 


“Other Interesting Resources/Big Data and Analytics Educational Resources 


Research,” B. Marr, ttp://computing.derby.ac.uk/bigdatares/?page id= 


“6 Amazing Sources of Practice Data Sets,” 


ttps://www.jigsawacademy.com/6-amazing-sources-of-practice-data- 


“20 Big Data Repositories You Should Check Out,” M. Krivanek, 
ttp://www.datasciencecentral.com/profiles/blogs/20-free-big-data 


ources-everyone-should-check-out 


“70+ Websites to Get Large Data Repositories for Free,” 
ttp://bigdata-madesimple.com/70-websites-to-get-large-data- 


epositories-for-free/. 


“Ten Sources of Free Big Data on Internet,” A. Brown, 
ttps://www.linkedin.com/pulse/ten-sources-free-big-data-internet 


rown. 


“Top 20 Open Data Sources,” 
ttps://www.linkedin.com/pulse/top-20-open-data-sources-zygimanta 


acikevicius. 


“We’re Setting Data, Code and APIs Free,” NASA, ttps://open.nasa.gov/oņg 
ata/. 


“Where Can I Find Large Datasets Open to the Public?” Quora, 


ttps://www.quora.com/Where-can-I-find-large-datasets-open-to-the 











6.2 RELATIONAL DATABASES AND STRUCTURED QUERY 
LANGUAGE (SQL) 


Databases are crucial, especially for big data. In hapter 9, we demonstrated sequential text- 
file processing, working with data from CSV files and working with JSON. Both are useful 
when most or all of a file’s data is to be processed. On the other hand, in transaction 
processing we need to locate and, possibly, update an individual data item quickly. 


A database is an integrated collection of data. A database management system 
(DBMS) provides mechanisms for storing and organizing data in a manner consistent with 
the database’s format. Database management systems allow for convenient access and 
storage of data without concern for the internal representation of databases. 


Relational database management systems (RDBMSs) store data in tables and define 
relationships among the tables. Structured Query Language (SQL) is used almost universally 
with relational database systems to manipulate data and perform queries, which request 


information that satisfies given criteria. 7 


? The writing in this chapter assumes that SQL is pronounced as see-quel. Some prefer ess 


que el. 


Popular open-source RDBMSs include SQLite, PostgreSQL, MariaDB and MySQL. These can 
be downloaded and used freely by anyone. All have support for Python. We’ll use SQLite, 
which is bundled with Python. Some popular proprietary RDBMSs include Microsoft SQL 
Server, Oracle, Sybase and IBM Db2. 


Tables, Rows and Columns 


A relational database is a logical table-based representation of data that allows the data to be 
accessed without consideration of its physical structure. The following diagram shows a 





sample Employee table that might be used in a personnel system: 


Number Name Department Salary Location 
23603 Jones 1413! 1100 New Jersey 
24568 Kerwin (413 L 2000 New Jersey 
Row { 134589 Larson sing tt 1800 Los Angeles | 
35761 Myers 611 i 1400 Orlando 
47132 Neumann ; 413 i 9000 New Jersey 
78321 Stephens ; 611! 8500 Orlando 
es 
Primary key Column 


The table’s primary purpose is to store employees’ attributes. Tables are composed of rows, 
each describing a single entity. Here, each row represents one employee. Rows are composed 
of columns containing individual attribute values. The table above has six rows. The 
Number column represents the primary key—a column (or group of columns) with a value 
that’s unique for each row. This guarantees that each row can be identified by its primary key. 
Examples of primary keys are Social Security numbers, employee ID numbers and part 
numbers in an inventory system—values in each of these are guaranteed to be unique. In this 


case, the rows are listed in ascending order by primary key, but they could be listed in 


descending order or no particular order at all. 


Each column represents a different data attribute. Rows are unique (by primary key) within a 
table, but particular column values may be duplicated between rows. For example, three 





different rows in the Employee table’s Department column contain number 413. 


Selecting Data Subsets 


Different database users are often interested in different data and different relationships 
among the data. Most users require only subsets of the rows and columns. Queries specify 
which subsets of the data to select from a table. You use Structured Query Language (SQL) to 





define queries. For example, you might select data from the Employee table to create a result 
that shows where each department is located, presenting the data sorted in increasing order 
by department number. This result is shown below. We'll discuss SQL shortly. 


lick here to view code image 


Department Location 


413 New Jersey 

611 Orlando 

642 Los Angeles 
SQLite 


The code examples in the rest of ection 16.2 use the open-source SQLite database 
management system that’s included with Python, but most popular database systems have 
Python support. Each typically provides a module that adheres to Python’s Database 
Application Programming Interface (DB-APD), which specifies common object and 


method names for manipulating any database. 


16.2.1 A books Database 


In this section, we'll present a books database containing information about several of our 
books. We'll set up the database in SQLite via the Python Standard Library’s sqlite3 
module, using a script provided in the ch16 example’s folder’s sql subfolder. Then, we'll 
introduce the database’s tables. We'll use this database in an IPython session to introduce 
various database concepts, including operations that create, read, update and delete data 
—the so-called CRUD operations. As we introduce the tables, we’ll use SQL and pandas 
DataFrames to show you each table’s contents. Then, in the next several sections, we'll 


discuss additional SQL features. 


Creating the books Database 


In your Anaconda Command Prompt, Terminal or shell, change to the ch16 examples 
folder’s sql subfolder. The following sqlite3 command creates a SQLite database named 
books. db and executes the books. sql SQL script, which defines how to create the 


database’s tables and populates them with data: 


sqlites books.db < books.sql 


The notation < indicates that books.sql is input into the sqlite3 command. When the 


command completes, the database is ready for use. Begin a new IPython session. 


Connecting to the Database in Python 


To work with the database in Python, first call sql ite3’s connect function to connect to 


the database and obtain a Connection object: 


lick here to view code image 


In 


In 


authors Table 


[List import sglites 


[2]: connection = sqlite3.connect ('books.db') 


The database has three tables—authors, author ISBNand titles. The authors table 


stores all the authors and has three columns: 


e id—The author’s unique ID number. This integer column is defined as 


autoincremented—for each row inserted in the table, SQLite increases the id value by 


1 to ensure that each row has a unique value. This column is the table’s primary key. 


e first—The author’s first name (a string). 


e last—The author’s last name (a string). 


Viewing the authors Table’s Contents 


Let’s use a SQL query and pandas to view the authors table’s contents: 


lick here to view code image 


In 


In 


Tr 


Out 


Gi s& to M e 








ERE Sie 


Paul 
Harvey 
Abbey 

Dan 
Alexander 


3]: import pandas as pd 


4]: pd.options.display.max_ columns = 10 


bili pd- read sql ("SELECT * PROM authors', connection, 


index _col=[‘1id"]) 


last 


Deitel 
Deitel 
Deitel 
Quirk 
Wald 


Pandas function read_sql executes a SQL query and returns a DataFrame containing the 


query’s results. The function’s arguments are: 


e a string representing the SQL query to execute, 


e the SQLite database’s Connection object, and in this case 


e an index col keyword argument indicating which column should be used as the 


DataFrame’s row indices (the author’s id values in this case). 


As youll see momentarily, when index col is not passed, index values starting from o 


appear to the left of the DataFrame’s rows. 


A SQL SELECT query gets rows and columns from one or more tables in a database. In the 


query: 


SELECT * FROM authors 


the asterisk (*) is a wildcard indicating that the query should get all the columns from the 











authors table. We’ll discuss SELECT queries in more detail shortly. 





titles Table 


The titles table stores all the books and has four columns: 


isbn—The book’s ISBN (a string) is this table’s primary key. ISBN is an abbreviation for 
“International Standard Book Number,” which is a numbering scheme that publishers use 
to give every book a unique identification number. 


e title—The book’s title (a string). 
e edition—The book’s edition number (an integer). 


e copyright—The book’s copyright year (a string). 


Let’s use SQL and pandas to view the titles table’s contents: 


lick here to view code image 


in. lel: pd-read isql("SELECT * FROM titles', connection) 














ouele: 

spn title edition copyright 
O 0135404673 Intro to Python for CS and DS al 2020 
I SOM S255 O06 Internet & WWW How to Program 5 2012 
2 0134743350 Java How to Program eat 2018 
a) URS So seu C How to Program 8 2016 
4 0133406954 Visual Basic 2012 How to Program 6 2014 
5 0134601548 Visual C# How to Program 6 2017 
6 ULsclslar4 Visvel Crt How to Program 2 2008 
7 0134448235 C++ How to Program ine 2017 
8 0134444302 Android How to Program 3 201 
9 0134289366 Android 6 for Programmers 3 2016 





author ISBN Table 


The author ISBN table uses the following columns to associate authors from the authors 
table with their books in the titles table: 


e id—An author’s id (an integer). 


e isbn—The book’s ISBN (a string). 


The id column is a foreign key, which is a column in this table that matches a primary-key 
column in another table—in particular, the authors table’s id column. The isbn column 
also is a foreign key—it matches the titles table’s isbn primary-key column. A database 
might have many tables. A goal when designing a database is to minimize data duplication 
among the tables. To do this, each table represents a specific entity, and foreign keys help link 
the data in multiple tables. The primary keys and foreign keys are designated when you 
create the database tables (in our case, in the books . sq1 script). 


Together the id and isbn columns in this table form a composite primary key. Every row in 
this table uniquely matches one author to one book’s ISBN. This table contains many entries, 


so let’s use SQL and pandas to view just the first five rows: 


lick here to view code image 


In: [Wis sd = pd- read sgl (MSELECT * FROM author ISBN", connection) 


In [8]: df.head() 


Out [Ss]: 

aol isbn 
0 1 0134289366 
al 2 0134289366 
2 5 0134289366 
3 1 0135404673 
4 2 0135404673 


Every foreign-key value must appear as the primary-key value in a row of another table so the 
DBMS can ensure that the foreign-key value is valid. This is known as the Rule of 
Referential Integrity. For example, the DBMS ensures that the id value for a particular 
author ISBN rowis valid by checking that there is a row in the authors table with that id 


as the primary key. 


Foreign keys also allow related data in multiple tables to be selected from those tables and 
combined—this is known as joining the data. There is a one-to-many relationship 
between a primary key and a corresponding foreign key—one author can write many books, 
and similarly one book can be written by many authors. So a foreign key can appear many 
times in its table but only once (as the primary key) in another table. For example, in the 
books database, the ISBN 0134289366 appears in several author ISBN rows because this 


book has several authors, but it appears only once as a primary key in titles. 


Entity-Relationship (ER) Diagram 


The following entity-relationship- (ER) diagram for the books database shows the 


database’s tables and the relationships among them: 





authors author_ISBN titles 
id > . isbn 
first isbn - title 
last edition 


copyright 


he first compartment in each box contains the table’s name, and the remaining 
compartments contain the table’s columns. The names in italic are primary keys. A table’s 
primary key uniquely identifies each row in the table. Every row must have a primary-key 
value, and that value must be unique in the table. This is known as the Rule of Entity 
Integrity. Again, for the author ISBN table, the primary key is the combination of both 


columns—this is known as a composite primary key. 


The lines connecting the tables represent the relationships among the tables. Consider the 
line between authors and author ISBN. On the authors end there’s a 1, and on the 
author ISBN end there’s an infinity symbol (¥). This indicates a one-to-many relationship. 
For each author in the authors table, there can be an arbitrary number of ISBNs for books 
written by that author in the author_ISBN table—that is, an author can write any number 
of books, so an author’s id can appear in multiple rows of the author ISBN table. The 
relationship line links the id column in the authors table (where id is the primary key) to 
the id column in the author_ISBN table (where id is a foreign key). The line between the 
tables links the primary key to the matching foreign key. 


The line between the titles and author ISBN tables illustrates a one-to-many 
relationship—one book can be written by many authors. The line links the primary key isbn 
in table titles to the corresponding foreign key in table author ISBN. The relationships 
in the entity-relationship diagram illustrate that the sole purpose of the author ISBN table 
is to provide a many-to-many relationship between the authors and titles tables—an 


author can write many books, and a book can have many authors. 


SQL Keywords 


The following subsections continue our SQL presentation in the context of our books 
database, demonstrating SQL queries and statements using the SQL keywords in the 
following table. Other SQL keywords are beyond this text’s scope: 


SQL 


keyword 














SELECT Retrieves data from one or more tables. 














FROM Tables involved in the query. Required in every SELECT. 





Criteria for selection that determine the rows to be retrieved, deleted or 


























WHERE 

updated. Optional in a SQL statement. 
GROUP ee p s , 

Criteria for grouping rows. Optional in a SELECT query. 
BY 


ORDER 














BY Criteria for ordering rows. Optional in a SEI 


ECT query. 





INNER f 
Merge rows from multiple tables. 





JOIN 





INSERT Insert rows into a specified table. 





UPDATE Update rows in a specified table. 














DELETE Delete rows from a specified table. 





16.2.2 SELECT Queries 


The previous section used SELECT statements and the * wildcard character to get all the 
columns from a table. Typically, you need only a subset of the columns, especially in big data 
where you could have dozens, hundreds, thousands or more columns. To retrieve only 
specific columns, specify a comma-separated list of column names. For example, let’s retrieve 


only the columns first and last from the authors table: 


lick here to view code image 


Ins Fol: pa-read isqii( “SELECT first, last FROM authors', connection) 
Out] S12 
EESE last 
0 Paul Deitel 
1 Harvey Deitel 
2 Abbey Deitel 
3 Dan Ouirk 
4 Alexander Wald 


16.2.3 WHERE Clause 


You'll often select rows in a database that satisfy certain selection criteria, especially in big 
data where a database might contain millions or billions of rows. Only rows that satisfy the 
selection criteria (formally called predicates) are selected. SQL’s WHERE clause specifies a 
query’s selection criteria. Let’s select the title, edition and copyright for all books with 
copyright years greater than 2016. String values in SQL queries are delimited by single (') 


quotes, asin '2016': 


lick here to view code image 


Ins [LO]: ipdereadusgii(" "SELECT title, edition, copyright 
FROM titles 
WHERE copyright. > '2016'""™,; connection) 


Cun jley: 


title edition copyright 


0 dntro to Python for CS and DS 1 2020 
al Java How to Program qia 2018 
2 Visual C# How to Program 6 20057 
2 C++ How to Program 10 2017y 
4 Android How to Program 3 2017 


Pattern Matching: Zero or More Characters 








Gl 


The WHERE clause may can contain the operators <, >, <=, >=, =, <> (not equal) and LIKE. 











Operator LIKE is used for pattern matching—searching for strings that match a given 
pattern. A pattern that contains the percent (%) wildcard character searches for strings that 
have zero or more characters at the percent character’s position in the pattern. For example, 


let’s locate all authors whose last name starts with the letter D: 


lick here to view code image 


In ii: pd-read sqili("! "SELECT id, first, last 
FROM authors 
WHERE last LIKE ODS Au, 


connection, index _col=['id']) 


Grea alah es 


first last 
id 
1 Paul Deitel 
2 Harvey Deitel 
3 Abbey Deitel 


Pattern Matching: Any Character 


An underscore (_) in the pattern string indicates a single wildcard character at that 
position. Let’s select the rows of all the authors whose last names start with any character, 
followed by the letter b, followed by any number of additional characters (specified by %): 


lick here to view code image 


Im [e222 jpdereadisgii(* "SELECT td, first; last 
EROM authors 
WHERE first ETRE Lee ee 

connection, index .col=( iani 
Out [12]: 
Pies last 
id 
3 Abbey Deitel 


16.2.4 ORDER BY Clause 


The ORDER BY clause sorts a query’s results into ascending order (lowest to highest) or 





descending order (highest to lowest), specified with ASC and DESC, respectively. The default 


sorting order is ascending, so ASC is optional. Let’s sort the titles in ascending order: 


lick here to view code image 


In, MS]: pda. read sgl ("SELECT title FROM titles: ORDER BY title ASCE, 


SEA connection) 
owe oln: 








title 
0 Android 6 for Programmers 
al Android How to Program 
2 C How to Program 
3 C++ How to Program 
4 Internet & WWW How to Program 
5 nero to Python for CS and DS 
6 Java How to Program 
7 Visual Basic 2012 How to Program 
8 Visual C# How to Program 
9 Visual C++ How to Program 


Sorting By Multiple Columns 


To sort by multiple columns, specify a comma-separated list of column names after the 








ORDER BY keywords. Let’s sort the authors’ names by last name, then by first name for any 


authors who have the same last name: 


lick here to view code image 


ine [4s pd read sa MAMES ELECTI nol ueatiosnes, Last 
FROM authors 
ORDER BY lsat Ers eon 


connection, index _col=['id']) 

Out[14]: 

PLeet last 
id 
3 Abbey Deitel 
2 Harvey Deitel 
il Paul Deitel 
= Dan Quirk 
5 Alexander Wald 


The sorting order can vary by column. Let’s sort the authors in descending order by last name 
and ascending order by first name for any authors who have the same last name: 


lick here to view code image 


int tsik pd read sgi (TMTSELECTI id first; lást 
FROM authors 
ORDER BY last DESCH TITSE ASCU 
connection, index col= rari 
OaeES] 
Parser ast 
5 Alexander Wald 
4 Dan Quirk 
Si Abbey Deitel 
2 
dl 


Harvey Deitel 
Paul Deitel 


Combining the WHERE and ORDER BY Clauses 














The WHERE and ORDER BY clauses can be combined in one query. Let’s get the isbn, title, 





edition and copyright of each book in the titles table that has a title ending with 


"How to Program' and sort them in ascending order by title. 


lick here to view code image 


In [16]: pd.read_ sql ("""SELECT isbn, title, edition, copyright 
FROM titles 
WHERE title LIKE 'SHow to Program! 








ORDER BY eitle mit connection) 

Out [16]: 

Leon title edition copyright 
O 0134444302 Android How to Program 3 2017 
Li O13 3976890 C How to Program 8 2016 
2 0134448235 C++ How to Program 10 2017 
So OTS 251006 Internet & WWW How to Program 5 2012 
4 0134743350 Java How to Program JES 2018 
5 0133406954 Visual Basic 2012 How to Program 6 2014 
6 0134601548 Visual C# How to Program 6 20g 
TO Obie ySaltspikisyy i Visual C++ How to Program 2 2008 








16.2.5 Merging Data from Multiple Tables: INNER JOIN 


Recall that the books database’s author ISBN table links authors to their corresponding 
titles. If we did not separate this information into individual tables, we'd need to include 
author information with each entry in the titles table. This would result in storing 


duplicate author information for authors who wrote multiple books. 


You can merge data from multiple tables, referred to as joining the tables, with INNER JOIN. 
Let’s produce a list of authors accompanied by the ISBNs for books written by each author— 
because there are many results for this query, we show just the head of the result: 


lick here to view code image 


in, [wis jodereadusqii(" ' SELHer first, last, “sbi 
FROM authors 
INNER JOIN author ISBN 
ON authors.id = author ISBN.id 


S ORDER BY last, first""", connection) .head() 
Outs 
Tirst last isbn 
0 Abbey Deitel 0132151006 
al Abbey Deitel 0133406954 
2 Harvey Deitel 0134289366 
3 Harvey Deitel 0135404673 
4 Harvey Deitel 0132151006 





The INNER JOIN’s ON clause uses a primary-key column in one table and a foreign-key 
column in the other to determine which rows to merge from each table. This query merges 
the authors table’s first and last columns with the author_ISBN table’s isbn column 


and sorts the results in ascending order by last then first. 


Note the syntax authors. id (table_name.column_name) in the ON clause. This qualified 
name syntax is required if the columns have the same name in both tables. This syntax can 
be used in any SQL statement to distinguish columns in different tables that have the same 
name. In some systems, table names qualified with the database name can be used to 





perform cross-database queries. As always, the query can contain an ORDER BY clause. 


16.2.6 INSERT INTO Statement 


To this point, you’ve queried existing data. Sometimes you'll execute SQL statements that 
modify the database. To do so, you'll use a sql ite3 Cursor object, which you obtain by 


calling the Connection’s cursor method: 


lick here to view code image 


In [18]: cursor = connection.cursor () 


The pandas method read _sq1 actually uses a Cursor behind the scenes to execute queries 


and access the rows of the results. 


The INSERT INTO statement inserts a row into a table. Let’s insert a new author named Sue 
Red into the authors table by calling Cursor method execute, which executes its SQL 


argument and returns the Cursor: 


lick here to view code image 


In [19]: cursor = cursor.execute("""INSERT INTO authors: (frst, Lasti 
VALUES (mouet Regd Sj rum 


The SQL keywords INSERT INTO are followed by the table in which to insert the new row 





and a comma-separated list of column names in parentheses. The list of column names is 
followed by the SQL keyword VALUES and a comma-separated list of values in parentheses. 


The values provided must match the column names specified both in order and type. 


We do not specify a value for the id column because it’s an autoincremented column in the 
authors table—this was specified in the script books . sq1 that created the table. For every 
new row, SQLite assigns a unique id value that is the next value in the autoincremented 
sequence (i.e., 1, 2, 3 and so on). In this case, Sue Red is assigned id number 6. To confirm 


this, let’s query the authors table’s contents: 


lick here to view code image 


in. [20] %) paread isqli( SELECT ad, first, Last FERON authers; 
connection, index _col=['id']) 
Out [20]: 
fae last 
id 
i Paul Deitel 
2 Harvey Deitel 
3 Abbey Deitel 
4 Dan Quirk 
5 Alexander Wald 
6 Sue Red 


Note Regarding Strings That Contain Single Quotes 


SQL delimits strings with single quotes ('). A string containing a single quote, such as 


O’Malley, must have two single quotes in the position where the single quote appears (e.g., 


'O''Malley'). The first acts as an escape character for the second. Not escaping single- 


quote characters in a string that’s part of a SQL statement is a SQL syntax error. 


16.2.7 UPDATE Statement 


An UPDATE statement modifies existing values. Let’s assume that Sue Red’s last name is 


incorrect in the database and update it to 'Black': 


lick here to view code image 


In [21]: cursor = cursor.execute ("""UPDATE authors SET last="Black! 
WHERE last='Red' AND first='Sue'"™"") 





The UPDATE keyword is followed by the table to update, the keyword SET and a comma- 


separated list of column_name = value pairs indicating the columns to change and their new 











values. The change will be applied to every row if you do not specify a WHERE clause. The 














WHERE clause in this query indicates that we should update only rows in which the last name 





is 'Red' and the first name is ' Sue'. 


Of course, there could be multiple people with the same first and last name. To make a 











change to only one row, it’s best to use the row’s unique primary key in the WHERE clause. In 





this case, we could have specified: 


WHERE id = 6 


For statements that modify the database, the Cursor object’s rowcount attribute contains 


an integer value representing the number of rows that were modified. If this value is 0, no 





changes were made. The following confirms that the UPDATE modified one row: 


lick here to view code image 


in i22]: eunsor. coweoume 
Out [225 


We also can confirm the update by listing the authors table’s contents: 


lick here to view code image 


In [23]: pd- -read sgl (SELECT id, first, last FROM authors’, 
- connection, index _col=['id']) 

Cuello. 
Ere last 

id 

1 Paul Deitel 

2 Harvey Deitel 

3 Abbey Deitel 

4 Dan Quirk 

5 Alexander Wald 

6 Sue Black 


16.2.8 DELETE FROM Statement 


A SQL DELETE FROM statement removes rows from a table. Let’s remove Sue Black from the 


authors table using her author ID: 


lick here to view code image 


In [24]: cursor = cursor.execute('DELETE FROM authors WHERE id=6') 


In [25]: cursor.crowcount 
Owe sdr m 














The optional WHERE clause determines which rows to delete. If WHERE is omitted, all the 





























table’s rows are deleted. Here’s the authors table after the DELETE operation: 


lick here to view code image 


im 2o] pa- -read sg (SELEC: Td; tirst, last ERON authexs), 
connection, index col=f"idt]) 
Out [26]: 
Pal Sit last 
id 


1 Paul Deitel 
2 Harvey Deitel 
3 Abbey Deitel 
= Dan Cuark 
5 Alexander Wald 


Closing the Database 


When you no longer need access to the database, you should call the Connection’s close 


method to disconnect from the database: 


connection.close() 


SQL in Big Data 


SQL’s importance is growing in big data. Later in this chapter, we'll use Spark SQL to query 
data in a Spark DataFrame for which the data may be distributed over many computers in a 


Spark cluster. As you'll see, Spark SQL looks much like the SQL presented in this section. 


16.3 NOSQL AND NEWSQL BIG-DATA DATABASES: A BRIEF 
TOUR 


For decades, relational database management systems have been the standard in data 
processing. However, they require structured data that fits into neat rectangular tables. As 
the size of the data and the number of tables and relationships increases, relational databases 
become more difficult to manipulate efficiently. In today’s big-data world, NoSQL and 
NewSQL databases have emerged to deal with the kinds of data storage and processing 
demands that traditional relational databases cannot meet. Big data requires massive 
databases, often spread across data centers worldwide in huge clusters of commodity 
computers. According to statista.com, there are currently over 8 million data centers 


worldwide. 3 


3 ttps://www.statista.com/statistics/500458/worldwide-datacenter- 


nd-it-sites/. 


oSQL originally meant what its name implies. With the growing importance of SQL in big 
data—such as SQL on Hadoop and Spark SQL—NoSQL now is said to stand for “Not Only 
SQL.” NoSQL databases are meant for unstructured data, like photos, videos and the natural 
language found in e-mails, text messages and social-media posts, and semi-structured data 
like JSON and XML documents. Semi-structured data often wraps unstructured data with 
additional information called metadata. For example, YouTube videos are unstructured 
data, but YouTube also maintains metadata for each video, including who posted it, when it 
was posted, a title, a description, tags that help people discover the videos, privacy settings 
and more—all returned as JSON from the YouTube APIs. This metadata adds structure to the 


unstructured video data, making it semi-structured. 


The next several subsections overview the four NoSQL database categories—key—value, 
document, columnar (also called column-based) and graph. In addition, we'll overview 
NewSQL databases, which blend features of relational and NoSQL databases. In ection 16.4, 
we'll present a case study in which we store and manipulate a large number of JSON tweet 
objects in a NoSQL document database, then summarize the data in an interactive 
visualization displayed on a Folium map of the United States. 


16.3.1 NoSQL Key-Value Databases 


Like Python dictionaries, key—value databases * store key—value pairs, but they’re 
optimized for distributed systems and big-data processing. For reliability, they tend to 
replicate data in multiple cluster nodes. Some key—value databases, such as Redis, are 
implemented in memory for performance, and others store data on disk, such as HBase, 
which runs on top of Hadoop’s HDFS distributed file system. Other popular key—value 
databases include Amazon DynamoDB, Google Cloud Datastore and Couchbase. DynamoDB 
and Couchbase are multi-model databases that also support documents. HBase is also a 
column-oriented database. 


4 ttps://en.wikipedia.org/wiki/Key-value database 


16.3.2 NoSQL Document Databases 


A document database ° stores semi-structured data, such as JSON or XML documents. 
In document databases, you typically add indexes for specific attributes, so you can more 
efficiently locate and manipulate documents. For example, let’s assume you're storing JSON 
documents produced by IoT devices and each document contains a type attribute. You might 
add an index for this attribute so you can filter documents based on their types. Without 
indexes, you can still perform that task, it will just be slower because you have to search each 
document in its entirety to find the attribute. 


6 ttps://en.wikipedia.org/wiki/Document-oriented_ database 

The most popular document database (and most popular overall NoSQL database ° ) is 
MongoDB, whose name derives from a sequence of letters embedded in the word 
“humongous.” In an example, we'll store a large number of tweets in MongoDB for 
processing. Recall that Twitter’s APIs return tweets in JSON format, so they can be stored 
directly in MongoDB. After obtaining the tweets we’ll summarize them in a pandas Data- 


Frame and on a Folium map. Other popular document databases include Amazon 


DynamoDB (also a key—value database), Microsoft Azure Cosmos DB and Apache CouchDB. 


ttps://db-engines.com/en/ranking. 


16.3.3 NoSQL Columnar Databases 


In a relational database, a common query operation is to get a specific column’s value for 
every row. Because data is organized into rows, a query that selects a specific column can 
perform poorly. The database system must get every matching row, locate the required 
column and discard the rest of the row’s information. A columnar database ” ’ 8 , also 
called a column-oriented database, is similar to a relational database, but it stores 
structured data in columns rather than rows. Because all of a column’s elements are stored 


together, selecting all the data for a given column is more efficient. 


7 ttps://en.wikipedia.org/wiki/Columnar_ database. 


ttps://www.predictiveanalyticstoday.com/top-wide-columnar-store- 


atabases/. 
Consider our authors table in the books database: 


lick here to view code image 


TeS Tast 


ili Paul Deitel 
2 Harvey Deitel 
3 Abbey Deitel 
4 Dan Quirk 
5 Alexander Wald 


In a relational database, all the data for a row is stored together. If we consider each row as a 
Python tuple, the rows would be represented as (1, 'Paul', 'Deitel'), (2, 
'Harvey', 'Deitel'), etc. In a columnar database, all the values for a given column 
would be stored together, asin (1, 2, 3, 4, 5), ('Paul', 'Harvey', 'Abbey', 
'Dan', 'Alexander') and ('Deitel', 'Deitel', 'Deitel', 'Quirk', 
'Wald').The elements in each column are maintained in row order, so the value at a given 
index in each column belongs to the same row. Popular columnar databases include MariaDB 
ColumnStore and HBase. 


16.3.4 NoSQL Graph Databases 


A graph models relationships between objects. ° The objects are called nodes (or vertices) 
and the relationships are called edges. Edges are directional. For example, an edge 
representing an airline flight points from the origin city to the destination city, but not the 
reverse. A graph database ° stores nodes, edges and their attributes. 


2 ttps://en.wikipedia.org/wiki/Graph theory 


2 ttps://en.wikipedia.org/wiki/Graph database. 


If you use social networks, like Instagram, Snapchat, Twitter and Facebook, consider your 


social graph, which consists of the people you know (nodes) and the relationships between 
them (edges). Every person has their own social graph, and these are interconnected. The 
famous “six degrees of separation” problem says that any two people in the world are 
connected to one another by following a maximum of six edges in the worldwide social 

graph. * Facebook’s algorithms use the social graphs of their billions of monthly active users * 
to determine which stories should appear in each user’s news feed. By looking at your 
interests, your friends, their interests and more, Facebook predicts the stories they believe 


are most relevant to you. 3 


ttps://en.wikipedia.org/wiki/Six degrees of separation. 


? ttps://zephoria.com/top-15-valuable-facebook-statistics/. 





3 ttps://newsroom. fb.com/news/2018/05/inside-feed-news-feed-ranking/. 


Many companies use similar techniques to create recommendation engines. When you 
browse a product on Amazon, they use a graph of users and products to show you comparable 
products people browsed before making a purchase. When you browse movies on Netflix, 
they use a graph of users and movies they liked to suggest movies that might be of interest to 
you. 


One of the most popular graph databases is Neo4j. Many real-world use-cases for graph 
databases are provided at: 


ttps://neo4j.com/graphgists/ 


With most of the use-cases, sample graph diagrams produced by Neo4j are shown. These 
visualize the relationships between the graph nodes. Check out Neo4j’s free PDF book, Graph 
Databases. * 


4 ttps://neo4j.com/graph-databases-book-sx2 


16.3.5 NewSQL Databases 


Key advantages of relational databases include their security and transaction support. In 
particular, relational databases typically use ACID (Atomicity, Consistency, Isolation, 
Durability) * transactions: 


2 ttps://en.wikipedia.org/wiki/ACID (computer science). 


e Atomicity ensures that the database is modified only if all of a transaction’s steps are 
successful. If you go to an ATM to withdraw $100, that money is not removed from your 
account unless you have enough money to cover the withdrawal and there is enough 
money in the ATM to satisfy your request. 


e Consistency ensures that the database state is always valid. In the withdrawal example 
above, your new account balance after the transaction will reflect precisely what you 
withdrew from your account (and possibly ATM fees). 


e Isolation ensures that concurrent transactions occur as if they were performed 


sequentially. For example, if two people share a joint bank account and both attempt to 


withdraw money at the same time from two separate ATMs, one transaction must wait 


until the other completes. 


e Durability ensures that changes to the database survive even hardware failures. 


If you research benefits and disadvantages of NoSQL databases, you'll see that NoSQL 
databases generally do not provide ACID support. The types of applications that use NoSQL 
databases typically do not require the guarantees that ACID-compliant databases provide. 
Many NoSQL databases typically adhere to the BASE (Basic Availability, Soft-state, 
Eventual consistency) model, which focuses more on the database’s availability. Whereas, 
ACID databases guarantee consistency when you write to the database, BASE databases 


provide consistency at some later point in time. 


NewSQL databases blend the benefits of both relational and NoSQL databases for big-data 
processing tasks. Some popular NewSQL databases include VoltDB, MemSQL, Apache Ignite 
and Google Spanner. 


16.4 CASE STUDY: AMONGODB JSON DOCUMENT 
DATABASE 


MongoDB is a document database capable of storing and retrieving JSON documents. 
Twitter’s APIs return tweets to you as JSON objects, which you can write directly into a 
MongoDB database. In this section, you'll: 


use Tweepy to stream tweets about the 100 U.S. senators and store them into a MongoDB 


database, 
e use pandas to summarize the top 10 senators by tweet activity and 


e display an interactive Folium map of the United States with one popup marker per state 
that shows the state name and both senators’ names, their political parties and tweet 


counts. 


You'll use a free cloud-based MongoDB Atlas cluster, which requires no installation and 
currently allows you to store up to 512MB of data. To store more, you can download the 


MongoDB Community Server from: 


lick here to view code image 
ttps://www.mongodb.com/download-center/community 


and run it locally or you can sign up for MongoDB’s paid Atlas service. 


Installing the Python Libraries Required for Interacting with MongoDB 


You'll use the pymongo library to interact with MongoDB databases from your Python code. 
You'll also need the dnspython library to connect to a MongoDB Atlas Cluster. To install 


these libraries, use the following commands: 


lick here to view code image 


conda install -c conda-forge pymongo 
conda install =e conda-forge dnspython 


keys.py 


The ch16 examples folder’s TwitterMongoDB subfolder contains this example’s code and 
keys. py file. Edit this file to include your Twitter credentials and your OpenMapQuest key 
from the “Data Mining Twitter” chapter. After we discuss creating a MongoDB Atlas cluster, 


youll also need to add your MongoDB connection string to this file. 


16.4.1 Creating the MongoDB Atlas Cluster 


To sign up for a free account go to 
ttps://mongodb.com 


then enter your email address and click Get started free. On the next page, enter your 
name and create a password, then read their terms of service. If you agree, click Get started 
free on this page and you'll be taken to the screen for setting up your cluster. Click Build 
my first cluster to get started. 


They walk you through the getting started steps with popup bubbles that describe and point 
you to each task you need to complete. They provide default settings for their free Atlas 
cluster (Mo as they refer to it), so just give your cluster a name in the Cluster Name 
section, then click Create Cluster. At this point, they'll take you to the Clusters page and 


begin creating your new cluster, which takes several minutes. 


Next, a Connect to Atlas popup tutorial will appear, showing a checklist of additional steps 


required to get you up and running: 


e Create your first database user—This enables you to log into your cluster. 


e Whitelist your IP address—This is a security measure which ensures that only IP 
addresses you verify are allowed to interact with your cluster. To connect to this cluster 
from multiple locations (school, home, work, etc.), you'll need to whitelist each IP address 


from which you intend to connect. 


e Connect to your cluster—In this step, you'll locate your cluster’s connection string, 
which will enable your Python code to connect to the server. 


Creating Your First Database User 


In the popup tutorial window, click Create your first database user to continue the 
tutorial, then follow the on-screen prompts to view the cluster’s Security tab and click + 
ADD NEW USER. In the Add New User dialog, create a username and password. Write 
these down—you'll need them momentarily. Click Add User to return to the Connect to 


Atlas popup tutorial. 


Whitelist Your IP Address 


In the popup tutorial window, click Whitelist your IP address to continue the tutorial, 
then follow the on-screen prompts to view the cluster’s IP Whitelist and click + ADD IP 


ADDRESS. In the Add Whitelist Entry dialog, you can either add your computer’s 
current IP address or allow access from anywhere, which they do not recommend for 
production databases, but is OK for learning purposes. Click ALLOW ACCESS FROM 
ANYWHERE then click Confirm to return to the Connect to Atlas popup tutorial. 


Connect to Your Cluster 


In the popup tutorial window, click Connect to your cluster to continue the tutorial, then 
follow the on-screen prompts to view the cluster’s Connect to YourClusterName dialog. 
Connecting to a MongoDB Atlas database from Python requires a connection string. To get 
your connection string, click Connect Your Application, then click Short SRV 
connection string. Your connection string will appear below Copy the SRV address. 
Click COPY to copy the string. Paste this string into the keys. py file as 

mongo connection string’s value. Replace "<PASSWORD>" in the connection string 
with your password, and replace the database name "test" with "senators", which will 
be the database name in this example. At the bottom of the Connect to 
YourClusterName, click Close. You're now ready to interact with your Atlas cluster. 


16.4.2 Streaming Tweets into MongoDB 


First we'll present an interactive IPython session that connects to the MongoDB database, 
downloads current tweets via Twitter streaming and summarizes the top-10 senators by tweet 
count. Next, we'll present class Tweet Listener, which handles the incoming tweets and 
stores their JSON in MongoDB. Finally, we'll continue the IPython session by creating an 


interactive Folium map that displays information from the tweets we stored. 


Use Tweepy to Authenticate with Twitter 


First, let’s use Tweepy to authenticate with Twitter: 
lick here to view code image 


In [1]: import tweepy, keys 


In [2]: auth = tweepy.OAuthHandler ( 
keys.consumer_key, keys.consumer_ secret) 
: auth.set_access_token(keys.access token, 
keys.access_ token secret) 


Next, configure the Tweepy API object to wait if our app reaches any Twitter rate limits. 


lick here to view code image 


In [3]: api = tweepy.API (auth, wait on rate limit=True, 


wait on rate limit notify=True) 





Loading the Senators’ Data 


We'll use the information in the file senators. csv (located in the ch16 examples folder’s 
TwitterMongoDB subfolder) to track tweets to, from and about every U.S. senator. The file 


contains the senator’s two-letter state code, name, party, Twitter handle and Twitter ID. 


Twitter enables you to follow specific users via their numeric Twitter IDs, but these must be 
submitted as string representations of those numeric values. So, let’s load senators.csv 
into pandas, convert the TwitterID values to strings (using Series method astype) and 
display several rows of data. In this case, we set 6 as the maximum number of columns to 
display. Later we’ll add another column to the DataFrame and this setting will ensure that 


all the columns are displayed, rather than a few with in between: 


lick here to view code image 











n [4]: import pandas as pd 

n PSl: ‘senators df = pd-read csv (‘senators csv") 

my |e]? senators dt “TwitterrD” | = senators Gf | TwitterlD"].astype(stx) 

n [7]: pd.options.display.max_columns = 6 

n [8]: senators df.head() 
Out [8 

State Name Party TwitterHandle TwitterID 

0 AL Richard Shelby R SenShelby 21111098 
a AL Doug Jomes D SenDougJones ga T08 0085121175552 
2 AK Lisa Murkowski R lisamurkowski 18061669 
3 AK Dan Sullivan R  SenDanSullivan 2891210047 
4 AZ Jon Kyl R SenJonkyl 24905240 


Configuring the MongoClient 


To store the tweet’s JSON as documents in a MongoDB database, you must first connect to 
your MongoDB Atlas cluster via a pymongo MongoClient, which receives your cluster’s 


connection string as its argument: 
lick here to view code image 


In [9]: from pymongo import MongoClient 


En [20]: atlas i claent = MongoClient (keys.mongo_connection_string) 


Now, we can get a pymongo Database object representing the senators database. The 


following statement creates the database if it does not exist: 


lick here to view code image 


in: [Gal do = atlas claent ssenators 


Setting up Tweet Stream 


Let’s specify the number of tweets to download and create the Tweet Listener. We pass the 
db object representing the MongoDB database to the Tweet Listener so it can write the 
tweets into the database. Depending on the rate at which people are tweeting about the 
senators, it may take minutes to hours to get 10,000 tweets. For testing purposes, you might 


want to use a smaller number: 


lick here to view code image 


In [12]: from tweetlistener import TweetListener 
Im, [sie tweet Nimit = 10000 


In [14]: twitter stream = tweepy.Stream(api.auth, 
TweetListener (api, db, tweet limit) ) 


Starting the Tweet Stream 


Twitter live streaming allows you to track up to 400 keywords and follow up to 5,000 Twitter 
IDs at a time. In this case, let’s track the senators’ Twitter handles and follow the senator’s 
Twitter IDs. This should give us tweets from, to and about each senator. To show you 
progress, we display the screen name and time stamp for each tweet received, and the total 
number of tweets so far. To save space, we show here only one of those tweet outputs and 
replace the user’s screen name with XXXXXXX: 


lick here to view code image 


In [15]: twitter_stream.filter(track=senators df.TwitterHandle.tolist(), 
follow=senators df.TwitterID.tolist() ) 


Screen name: XXXXXXX 
Created at: Sun Dec 16 17:19:19 +0000 2018 
Tweets received: 1 


Class TweetListener 


For this example, we slightly modified class Tweet Listener from the “Data Mining 
Twitter” chapter. Much of the Twitter and Tweepy code shown below is identical to the code 


you saw previously, so we'll focus on only the new concepts here: 


lick here to view code image 


1 # tweetlistener.py 

2 """TweetListener downloads tweets and stores them in MongoDB. eN 

3 import json 

4 import tweepy 

5 

6 class TweetListener (tweepy.StreamListener): 

7 """Handles incoming Tweet stream. """ 

8 

9 def init__(self, api, database, limit=10000): 

10 """Create instance variables for tracking number of tweets.""" 
11 self.db = database 

12 self tweet count = 0 

13 self.TWEET_LIMIT = limit # 10,000 by default 

14 super(). init (api) + call superclass’s init 

15 

16 def on_connect (self): 

17 """Called when your connection attempt is successful, enabling 
18 you to perform appropriate application tasks at that POLNE maS 
19 print ('Successfully connected to Twitter\n') 

20 

21 def on_data(self, data): 

22 """Called when Twitter pushes a new tweet to you.""" 

23 self.tweet_count t= e track number ot tweets processed 


24 json_data = json.loads (data) # convert string to JSON 


25 self.db.tweets.insert_one(json_ data) # store in tweets collectior 


26 print (et Screen name: {json datali"user"™] [“name™] }") 

27 prine (f' Created at: {json datal“created atulis) 

28 print (f£'Tweets received: {self.tweet_count}') 

29 

30 # if TWEET LIMIT is reached, return False to terminate streaming 
31 return self. tweet count != self.TWEET LIMIT 

32 

a3 def on_error(self, status): 

34 print (status) 

35 return True 











reviously, TweetListener overrode method on status to receive Tweepy Status 
objects representing tweets. Here, we override the on data method instead (lines 21-31). 
Rather than Status objects, on_data receives each tweet object’s raw JSON. Line 24 
converts the JSON string received by on data into a Python JSON object. Each MongoDB 


database contains one or more Collections of documents. In line 25, the expression 
self.db.tweets 


accesses the Database object db’s tweets Collection, creating it if it does not already 
exist. Line 25 uses the tweets Collection’s insert_one method to store the JSON 


object in the tweets collection. 


Counting Tweets for Each Senator 


Next, we'll perform a full-text search on the collection of tweets and count the number of 
tweets containing each senator’s Twitter handle. To text search in MongoDB, you must create 
a text index for the collection. ° This specifies which document field(s) to search. Each text 
index is defined as a tuple containing the field name to search and the index type ('text'). 
MongoDB’s wildcard specifier ($ * *) indicates that every text field in a document (a JSON 


tweet object in our case) should be indexed for a full-text search: 


°For additional details on MongoDB index types, text indexes and operators, see: 
ttps://docs.mongodb.com/manual/indexes, 
ttps://docs.mongodb.com/manual/core/index-text and 


ttps://docs.mongodb.com/manual/reference/operator. 


lick here to view code image 


In [16]: db.tweets.create_index([('S**', Noses t2.")iib) 
oulike Voss seer)" 


Once the index is defined, we can use the Collection’s count_documents method to 
count the total number of documents in the collection that contain the specified text. Let’s 
search the database’s tweets collection for every twitter handle in the senators df 


DataFrame’s TwitterHandle column: 


lick here to view code image 


En: e tweet counts = i] 


In [18]: for senator in senators df.TwitterHandle: 
tweet _counts.append(db.tweets.count_documents ( 
i"Stext™: {"Ssearch": senator}})) 


The JSON object passed to count documents in this case indicates that we’re using the 


index named text to search for the value of senator. 


Show Tweet Counts for Each Senator 


Let’s create a copy of the DataFrame senators df that contains the tweet counts asa 


new column, then display the top-10 senators by tweet count: 
lick here to view code image 


In; LES]: tweet counts dt = senators df.assign(Tweets=tweet_ counts) 


In [20]: tweet counts df.sort_values (by='Tweets', 
ascending=False) .head(10) 





Ouse: 

State Name Party TwitterHandle TwitterID Tweets 
78 SC Lindsey Graham R LindseyGrahamSC 432895323 1405 
41 MA Elizabeth Warren D SenWarren 970207298 1249 
8 CA Dianne Feinstein D SenFeinstein 476256944 LOTS 
20 H Brian Schatz D brianschatz 47747074 934 
62 NY Chuck Schumer D SenSchumer 17494010 811 
24 IL Tammy Duckworth D SenDuckworth 1058520120 656 
13 CT Richard Blumenthal D SenBlumenthal 281240539 646 
fal H Mazie Hirono D maziehirono 9218:6819 628 
86 WR Orrin Hater R SenOrrinHatch 262756641 506 
qa R Sheldon Whitehouse D SenWhitehouse 242555999 350 








Get the State Locations for Plotting Markers 


Next, we'll use the techniques you learned in the “Data Mining Twitter” chapter to get each 
state’s latitude and longitude coordinates. We'll soon use these to place on a Folium map 
popup markers that contain the names and numbers of tweets mentioning each state’s 


senators. 


The file state_codes .py contains a state_codes dictionary that maps two-letter state 
codes to their full state names. We'll use the full state names with geopy’s OpenMapQuest 
geocode function to look up the location of each state. ’ First, let’s import the libraries we 


need and the state_codes dictionary: 


7We use full state names because, during our testing, the two-letter state codes did not 


always return correct locations. 
lick here to view code image 
In [21]: from geopy import OpenMapQuest 


In [22]: Importe tame 


In [23]: from state codes import state codes 


Next, let’s get the geocoder object to translate location names into Location objects: 


lick here to view code image 
In [24]: geo = OpenMapQuest (api_key=keys.mapquest_key) 


There are two senators from each state, so we can look up each state’s location once and use 
the Location object for both senators from that state. Let’s get the unique state names, then 


sort them into ascending order: 


lick here to view code image 


In [25]: states = tweet_counts_df.State.unique() 


in 26]: stakes sort () 


The next two snippets use code from the “Data Mining Twitter” chapter to look up each 
state’s location. In snippet [28], we call the geocode function with the state name followed 
by ', USA' to ensure that we get United States locations, 8 since there are places outside the 
United States with the same names as U.S. states. To show progress, we display each new 


Location object’s string: 


8When we initially performed the geocoding for Washington state, OpenMapQuest returned 
Washington, D.C.s location. So we modified state _codes.py to use Washington State 


instead. 


lick here to view code image 


In (2710s locations = I] 


in [28]: for state in states: 
processed = False 
delay = .1 
while not processed: 
trys 
locations.append ( 
geo.geocode (state _codes[state] + ', USA')) 
print (locations [—1))) 
processed = True 
except: # timed out, so wait before trying again 
print ('OpenMapQuest service timed out. Waiting.') 
time.sleep (delay) 
delay += .1 


Alaska, United States of America 
Alabama, United States of America 
Arkansas, United States of America 


Grouping the Tweet Counts by State 


We'll use the total number of tweets for the two senators in a state to color that state on the 
map. Darker colors will represent the states with higher tweet counts. To prepare the data for 
mapping, let’s use the pandas DataFrame method groupby to group the senators by state 


and calculate the total tweets by state: 


lick here to view code image 


In [29]: tweets counts by state = tweet_counts_df.groupby ( 
'State', as_index=False) .sum() 


In [30]: tweets counts by state.head() 


Cue sets 

State Tweets 
0 AK 27 
al AL 2 
2 AR 47 
3 AZ 47 
4 CA 1125 


Theas_index=False keyword argument in snippet [29] indicates that the state codes 
should be values in a column of the resulting GroupBy object, rather than the indices for the 
rows. The GroupBy objects sum method totals the numeric data (the tweets by state). 


Snippet [30] displays several rows of the GroupBy object so you can see some of the results. 


Creating the Map 


Next, let’s create the map. You may want to adjust the zoom. On our system, the following 
snippet creates a map in which we initially can see only the continental United States. 
Remember that Folium maps are interactive, so once the map is displayed, you can scroll to 
zoom in and out or drag to see different areas, such as Alaska or Hawaii: 


lick here to view code image 


ns PS e Impor e Eou 


in [22]: usmap = folium.Map(location=(39.8293, =98.5795], 
: zoom start=4; detect retina=True, 
tiles='Stamen Toner') 


Creating a Choropleth to Color the Map 


A choropleth shades areas in a map using the values you specify to determine color. Let’s 
create a choropleth that colors the states by the number of tweets containing their senators’ 
Twitter handles. First, save Folium’s us-states .j son file at 


ttps://raw.githubusercontent.com/python-visualization/folium/master/examples/d 
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o the folder containing this example. This file contains a JSON dialect called Geo JSON 
(Geographic JSON) that describes the boundaries of shapes—in this case, the boundaries 
of every U.S. state. The choropleth uses this information to shade each state. For more about 
GeoJSON, see ttp://geojson.org/. ° The following snippets create the choropleth, then 
add it to the map: 


°Folium provides several other GeoJSON files in its examples folder at 


ttps://github.com/python- 








isualization/folium/tree/master/examples/data. You also can create your own 


at ttp://geojson.io. 


lick here to view code image 


In [33]: choropleth = folium.Choropleth ( 
géo_data="us-states. json", 
name='choropleth', 
data=tweets counts by state, 
columns=['State', 'Tweets'], 
key on=' feature. ird", 
fall color=" YIorRdi;, 
fil POpacaney Ui dy 
line opacity=0.2;, 
legend_name='Tweets by State" 





) .«add_to(usmap) 


In [34]: layer = folium.LayerControl().add_to(usmap) 


In this case, we used the following arguments: 


e geo data='us-states.json'—This is the file containing the GeoJSON that specifies 


the shapes to color. 


e name='choropleth'—Folium displays the Choropleth as a layer over the map. This 
is the name for that layer that will appear in the map’s layer controls, which enable you to 
hide and show the layers. These controls appear when you click the layers icon () on the 


map. 


e data=tweets counts by state—This is a pandas DataFrame (or Series) 


containing the values that determine the Choropleth colors. 


e columns=['State', 'Tweets']—When the data isa DataFrame, this is a list of 
two columns representing the keys and the corresponding values used to color the 
Choropleth. 





e key on='feature.id'—This is a variable in the GeoJSON file to which the 


Choropleth binds the values in the columns argument. 





e fill color='Y10rRd'—This is a color map specifying the colors to use to fill in the 


states. Folium provides 12 colormaps: 'BuGn', 'BuPu', 'GnBu', 'OrRd', 'PuBu', 








'PuBuGn', 'PuRd', 'RdPu', 'YlGn', 'YlGnBu', 'YlOrBr' and 'YlOrRd'. You 
should experiment with these to find the most effective and eye-pleasing ones for your 


application(s). 


e fill opacity=0.7—A value from 0.0 (transparent) to 1.0 (opaque) specifying the 


transparency of the fill colors displayed in the states. 


e line opacity=0.2—A value from 0.0 (transparent) to 1.0 (opaque) specifying the 


transparency of lines used to delineate the states. 


e legend name='Tweets by State'—At the top of the map, the Choropleth displays 
a color bar (the legend) indicating the value range represented by the colors. This 


legend_name text appears below the color bar to indicate what the colors represent. 


The complete list of Choropleth keyword arguments is documented at: 


ttp://python-visualization.github.io/folium/modules.html#folium.features.Choro 
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reating the Map Markers for Each State 


Next, we'll create Markers for each state. To ensure that the senators are displayed in 
descending order by the number of tweets in each state’s Marker, let’s sort 


tweet counts df in descending order by the 'Tweets' column: 


lick here to view code image 


In [35]: sorted_df = tweet counts df.sort_ values ( 
by='Tweets', ascending=False) 


The loop in the following snippet creates the Markers. First, 


sorted dt ‘groupby (State) 


groups sorted df by 'State'.A DataFrame’s groupby method maintains the original 
row order in each group. Within a given group, the senator with the most tweets will be first, 


because we sorted the senators in descending order by tweet count in snippet [35]: 


lick here to view code image 


In [36]: for index, (name, group) in enumerate(sorted_df.groupby('State')): 
strings = [state_codes[name]] # used to assemble popup text 


for s in group.itertuples(): 
strings.append ( 
strings.append ( 


text = '<bre>").join(strings) 

marker = folium.Marker ( 
(locations index]. latitude, locations [index] .longitude), 
popup=text) 

marker.add_to(usmap) 





We pass the grouped DataFrame to enumerate, so we can get an index for each group, 
which we'll use to look up each state’s Location in the locations list. Each group has a 
name (the state code we grouped by) and a collection of items in that group (the two senators 


for that state). The loop operates as follows: 


e We look up the full state name in the state_codes dictionary, then store it in the 


strings list—we'll use this list to assemble the Marker’s popup text. 


e The nested loop walks through the items in the group collection, returning each as a 
named tuple that contains a given senator’s data. We create a formatted string for the 


current senator containing the person’s name, party and number of tweets, then append 


that to the strings list. 


e The Marker text can use HTML for formatting. We join the strings lists elements, 
separating each from the next with an HTML <br> element which creates a new line in 
HTML. 


e We create the Marker. The first argument is the Marker’s location as a tuple containing 
the latitude and longitude. The popup keyword argument specifies the text to display if 


the user clicks the Marker. 
e We add the Marker to the map. 
Displaying the Map 
Finally, let’s save the map into an HTML file 
lick here to view code image 
In [17]: usmap.save('SenatorsTweets.html') 
Open the HTML file in your web browser to view and interact with the map. Recall that you 


can drag the map to see Alaska and Hawaii. Here we show the popup text for the South 


Carolina marker: 








South Carolina 





Lindsey Graham (R); Tweets: 1405 
Tim Scott (R); Tweets: 11 








See ¥ 


| 





You could enhance this case study to use the sentiment-analysis techniques you learned in 
previous chapters to rate as positive, neutral or negative the sentiment expressed by people 


who send tweets (“tweeters”) mentioning each senator’s handle. 


16.5 HADOOP 


The next several sections show how Apache Hadoop and Apache Spark deal with big-data 
storage and processing challenges via huge clusters of computers, massively parallel 
processing, Hadoop MapReduce programming and Spark in-memory processing techniques. 
Here, we discuss Apache Hadoop, a key big-data infrastructure technology that also serves as 


the foundation for many recent advancements in big-data processing and an entire ecosystem 


f software tools that are continually evolving to support today’s big-data needs. 


16.5.1 Hadoop Overview 


When Google was launched in 1998, the amount of online data was already enormous with 
approximately 2.4 million websites °—truly big data. Today there are now nearly two billion 
websites * (almost a thousandfold increase) and Google is handling over two trillion searches 
per year! * Having used Google search since its inception, our sense is that today’s responses 
are significantly faster. 


° ttp://www.internetlivestats.com/total-number-of-websites/. 


l ttp://www.internetlivestats.com/total-number-of-websites/. 


? ttp://www.internetlivestats.com/google-search-statistics/. 

When Google was developing their search engine, they knew that they needed to return 
search results quickly. The only practical way to do this was to store and index the entire 
Internet using a clever combination of secondary storage and main memory. Computers of 
that time couldn’t hold that amount of data and could not analyze that amount of data fast 
enough to guarantee prompt search-query responses. So Google developed a clustering 
system, tying together vast numbers of computers—called nodes. Because having more 
computers and more connections between them meant greater chance of hardware failures, 
they also built in high levels of redundancy to ensure that the system would continue 
functioning even if nodes within clusters failed. The data was distributed across all these 
inexpensive “commodity computers.” To satisfy a search request, all the computers in the 
cluster searched in parallel the portion of the web they had locally. Then the results of those 
searches were gathered up and reported back to the user. 


To accomplish this, Google needed to develop the clustering hardware and software, 
including distributed storage. Google publishes its designs, but did not open source its 
software. Programmers at Yahoo!, working from Google’s designs in the “Google File System” 
paper, ° then built their own system. They open-sourced their work and the Apache 
organization implemented the system as Hadoop. The name came from an elephant stuffed 
animal that belonged to a child of one of Hadoop’s creators. 


3 ttp://static.googleusercontent.com/media/research.google.com/en//archive/gf 


osp2003.pdf. 


Two additional Google papers also contributed to the evolution of Hadoop—“MapReduce: 
Simplified Data Processing on Large Clusters” 4 and “Bigtable: A Distributed Storage System 
for Structured Data,” ° which was the basis for Apache HBase (a NoSQL key—value and 
column--based database). ° 


4 ttp://static.googleusercontent.com/media/research.google.com/en//archive/ma 


sdi04.pdf. 


5 ttp://static.googleusercontent.com/media/research.google.com/en//archive/bi 


sdi06.pdf. 


°Many other influential big-data-related papers (including the ones we mentioned) can be 


foundat: ttps://bigdata-madesimple.com/research-papers-that-changed- 
he-world-of-big-data/. 


DFS, MapReduce and YARN 


Hadoop’s key components are: 


e HDFS (Hadoop Distributed File System) for storing massive amounts of data 
throughout a cluster, and 


e MapReduce for implementing the tasks that process the data. 


Earlier in the book we introduced basic functional-style programming and filter/map/reduce. 
Hadoop MapReduce is similar in concept, just on a massively parallel scale. A MapReduce 
task performs two steps—mapping and reduction. The mapping step, which also may 
include filtering, processes the original data across the entire cluster and maps it into tuples 
of key—value pairs. The reduction step then combines those tuples to produce the results of 
the MapReduce task. The key is how the MapReduce step is performed. Hadoop divides the 
data into batches that it distributes across the nodes in the cluster—anywhere from a few 
nodes to a Yahoo! cluster with 40,000 nodes and over 100,000 cores. ” Hadoop also 
distributes the MapReduce task’s code to the nodes in the cluster and executes the code in 
parallel on every node. Each node processes only the batch of data stored on that node. The 
reduction step combines the results from all the nodes to produce the final result. To 
coordinate this, Hadoop uses YARN (“yet another resource negotiator”) to manage all 
the resources in the cluster and schedule tasks for execution. 





7 ttps://wiki.apache.org/hadoop/PoweredBy 


Hadoop Ecosystem 


Though Hadoop began with HDFS and MapReduce, followed closely by YARN, it has grown 
into a large ecosystem that includes Spark (discussed in ections 16.6- 6.7) and many other 


Apache projects: $, °, ° 


8 ttps://hortonworks.com/ecosystems/. 


° ttps://readwrite.com/2018/06/26/complete-guide-of-hadoop- 


cosystem-components/. 


° ttps://www.janbasktraining.com/blog/introduction-architecture- 


omponents-hadoop-ecosystem/. 


e Ambari( ttps://ambari.apache.org)—Tools for managing Hadoop clusters. 


e Drill ( ttps://drill.apache.org)—SQL querying of non-relational data in Hadoop 
and NoSQL databases. 


e Flume( ttps://flume.apache.org)—A service for collecting and storing (in HDFS 
and other storage) streaming event data, like high-volume server logs, IoT messages and 


more. 


e HBase( ttps://hbase.apache.org)—A NoSQL database for big data with “billions 


f rows by ` millions of columns—atop clusters of commodity hardware.” 
*We used the word by to replace X in the original text. 


e Hive( ttps://hive.apache.org)—Uses SQL to interact with data in data 
warehouses. A data warehouse aggregates data of various types from various sources. 
Common operations include extracting data, transforming it and loading (known as ETL) 


into another database, typically so you can analyze it and create reports from it. 


e Impala( ttps://impala.apache.org)—A database for real-time SQL-based 


queries across distributed data stored in Hadoop HDFS or HBase. 


e Kafka( ttps://kafka.apache.org)—Real-time messaging, stream processing and 
storage, typically to transform and process high-volume streaming data, such as website 


activity and streaming IoT data. 


e Pig( ttps://pig.apache.org)—A scripting platform that converts data analysis 
tasks from a scripting language called Pig Latin into MapReduce tasks. 


e Sqoop ( ttps://sqoop.apache.org)—Tool for moving structured, semi-structured 


and unstructured data between databases. 


e Storm( ttps://storm.apache.org)—A real-time stream-processing system for 


tasks such as data analytics, machine learning, ETL and more. 


e ZooKeeper ( ttps://zookeeper.apache.org)—A service for managing cluster 


configurations and coordination between clusters. 


e And more. 


Hadoop Providers 


Numerous cloud vendors provide Hadoop as a service, including Amazon EMR, Google Cloud 
DataProc, IBM Watson Analytics Engine, Microsoft Azure HDInsight and others. In addition, 
companies like Cloudera and Hortonworks (which at the time of this writing are merging) 
offer integrated Hadoop-ecosystem components and tools via the major cloud vendors. They 
also offer free downloadable environments that you can run on the desktop ° for learning, 
development and testing before you commit to cloud-based hosting, which can incur 
significant costs. We introduce MapReduce programming in the example in the following 
sections by using a Microsoft cloud-based Azure HDInsight cluster, which provides Hadoop 


as a service. 


*Check their significant system requirements first to ensure that you have the disk space and 


memory required to run them. 


Hadoop 3 


Apache continues to evolve Hadoop. Hadoop 3 ° was released in December of 2017 with 
many improvements, including better performance and significantly improved storage 


efficiency. 4 


3For a list of features in Hadoop 3, see ttps://hadoop.apache.org/docs/r3.0.0/. 


4 ttps://www.datanami.com/2018/10/18/is-hadoop-officially-dead/. 


16.5.2 Summarizing Word Lengths in Romeo and Juliet via MapReduce 


In the next several subsections, you'll create a cloud-based, multi-node cluster of computers 
using Microsoft Azure HDInsight. Then, you'll use the service’s capabilities to demonstrate 
Hadoop MapReduce running on that cluster. The MapReduce task you'll define will 
determine the length of each word in RomeoAndJuliet.txt (from the“ atural Language 
rocessing” chapter), then summarize how many words of each length there are. After 
defining the task’s mapping and reduction steps, you'll submit the task to your HDInsight 
cluster, and Hadoop will decide how to use the cluster of computers to perform the task. 


16.5.3 Creating an Apache Hadoop Cluster in Microsoft Azure HDInsight 


Most major cloud vendors have support for Hadoop and Spark computing clusters that you 
can configure to meet your application’s requirements. Multi-node cloud-based clusters 
typically are paid services, though most vendors provide free trials or credits so you can try 


out their services. 


We want you to experience the process of setting up clusters and using them to perform 
tasks. So, in this Hadoop example, you'll use Microsoft Azure’s HDInsight service to create 
cloud-based clusters of computers in which to test our examples. Go to 


ttps://azure.microsoft.com/en-us/free 


to sign up for an account. Microsoft requires a credit card for identity verification. 


Various services are always free and some you can continue to use for 12 months. For 


information on these services see: 
ttps://azure.microsoft.com/en-us/free/free-account-faq/ 


Microsoft also gives you a credit to experiment with their paid services, such as their 
HDInsight Hadoop and Spark services. Once your credits run out or 30 days pass (whichever 
comes first), you cannot continue using paid services unless you authorize Microsoft to 


charge your card. 


Because you'll use your new Azure account’s credit for these examples, ° we'll discuss how to 
configure a low-cost cluster that uses less computing resources than Microsoft allocates by 
default. ° Caution: Once you allocate a cluster, it incurs costs whether you’re 
using it or not. So, when you complete this case study, be sure to delete your 
cluster(s) and other resources, so you don’t incur additional charges. For more 


information, see: 


5For Microsoft’s latest free account features, visit ttps://azure.microsoft.com/en- 


s/free/. 


For Microsoft’s recommended cluster configurations, see 
ttps://docs.microsoft.com/en-us/azure/hdinsight/hdinsight- 


component-versioning#default-node-configuration-andvirtual-machine- 


izes-for-clusters. If you configure a cluster that’s too small for a given scenario, when 


you try to deploy the cluster you'll receive an error. 


ttps://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-po 
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or Azure-related documentation and videos, visit: 
e ttps://docs.microsoft.com/en-us/azure/—the Azure documentation. 
e ttps://channel9.msdn.com/—Microsoft’s Channel 9 video network. 


e ttps://www.youtube.com/user/windowsazure—Microsoft’s Azure channel on 
YouTube. 


Creating an HDInsight Hadoop Cluster 


The following link explains how to set up a cluster for Hadoop using the Azure HDInsight 


service: 


ttps://docs.microsoft.com/en-us/azure/hdinsight/hadoop/apache-hadoop-linux-cre 
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hile following their Create a Hadoop cluster steps, please note the following: 


e In Step 1, you access the Azure portal by logging into your account at 


ttps://portal.azure.com 


e In Step 2, Data + Analytics is now called Analytics, and the HDInsight icon and icon 
color have changed from what is shown in the tutorial. 


e In Step 3, you must choose a cluster name that does not already exist. When you enter 
your cluster name, Microsoft will check whether that name is available and display a 
message if it is not. You must create a password. For the Resource group, you'll also 
need to click Create new and provide a group name. Leave all other settings in this step 


as is. 


e In Step 5: Under Select a Storage account, click Create new and provide a storage 
account name containing only lowercase letters and numbers. Like the cluster name, the 


storage account name must be unique. 


When you get to the Cluster summary you'll see that Microsoft initially configures the 
cluster as Head (2 x D12 v2), Worker (4 x D4 v2). At the time of this writing, the 
estimated cost-per-hour for this configuration was $3.11. This setup uses a total of 6 CPU 


nodes with 40 cores—far more than we need for demonstration purposes. 


You can edit this setup to use fewer CPUs and cores, which also saves money. Let’s change 
the configuration to a four-CPU cluster with 16 cores that uses less powerful computers. In 
the Cluster summary: 


1. Click Edit to the right of Cluster size. 
2. Change the Number of Worker nodes to 2. 


3. Click Worker node size, then View all, select D3 v2 (this is the minimum CPU size for 


Hadoop nodes) and click Select. 
4. Click Head node size, then View all, select D3 v2 and click Select. 


5. Click Next and click Next again to return to the Cluster summary. Microsoft will 


validate the new configuration. 


6. When the Create button is enabled, click it to deploy the cluster. 


It takes 20-30 minutes for Microsoft to “spin up” your cluster. During this time, Microsoft is 
allocating all the resources and software the cluster requires. 


After the changes above, our estimated cost for the cluster was $1.18 per hour, based on 
average use for similarly configured clusters. Our actual charges were less than that. If you 
encounter any problems configuring your cluster, Microsoft provides HDInsight chat-based 
support at: 


ttps://azure.microsoft.com/en-us/resources/knowledge-center/technical--chat/ 








16.5.4 Hadoop Streaming 


For languages like Python that are not natively supported in Hadoop, you must use Hadoop 
streaming to implement your tasks. In Hadoop streaming, the Python scripts that 
implement the mapping and reduction steps use the standard input stream and 
standard output stream to communicate with Hadoop. Usually, the standard input 
stream reads from the keyboard and the standard output stream writes to the command line. 
However, these can be redirected (as Hadoop does) to read from other sources and write to 
other destinations. Hadoop uses the streams as follows: 


e Hadoop supplies the input to the mapping script—called the mapper. This script reads 
its input from the standard input stream. 


e The mapper writes its results to the standard output stream. 


e Hadoop supplies the mapper’s output as the input to the reduction script—called the 
reducer—which reads from the standard input stream. 


e The reducer writes its results to the standard output stream. 


e Hadoop writes the reducer’s output to the Hadoop file system (HDFS). 


The mapper and reducer terminology used above should sound familiar to you from our 
discussions of functional-style programming and filter, map and reduce in the “Sequences: 
Lists and Tuples” chapter. 


16.5.5 Implementing the Mapper 


In this section, you'll create a mapper script that takes lines of text as input from Hadoop and 
maps them to key—value pairs in which each key is a word, and its corresponding value is 1. 
The mapper sees each word individually so, as far as it is concerned, there’s only one of each 
word. In the next section, the reducer will summarize these key—value pairs by key, reducing 
the counts to a single count for each key. By default, Hadoop expects the mapper’s output and 
the reducer’s input and output to be in the form of key—value pairs separated by a tab. 


In the mapper script (length mapper. py), the notation #! in line 1 tells Hadoop to execute 
the Python code using python3, rather than the default Python 2 installation. This line must 
come before all other comments and code in the file. At the time of this writing, Python 2.7.12 
and Python 3.5.2 were installed. Note that because the cluster does not have Python 3.6 or 


higher, you cannot use f-strings in your code. 


lick here to view code image 


#!/usr/bin/env python3 

# length_mapper.py 

" "Maps lines of text to key-value pairs of word lengths and 1.""" 
Import sys 


def tokenize_input(): 


mumsplit each line of standard input into a list of siping. 


for Line ain) Sys- sSCdins 


wo OAT DU FWY PB 


yield line.split() 


o 


11 # read each line in the the standard input and for every word 
12 # produce a key-value pair containing the word, a tab and 1 

13 for line in tokenize input (): 

14 for word in line: 

15 print (str (len (word) ) PANEL 


Generator function tokenize input (lines 6—9) reads lines of text from the standard input 
stream and for each returns a list of strings. For this example, we are not removing 


punctuation or stop words as we did in the“ atural Language Processing” chapter. 


When Hadoop executes the script, lines 13—15 iterate through the lists of strings from 
tokenize input. For each list (line) and for every string (word) in that list, line 15 
outputs a key—value pair with the word’s length as the key, a tab (\ t) and the value 1, 
indicating that there is one word (so far) of that length. Of course, there probably are many 
words of that length. The MapReduce algorithm’s reduction step will summarize these key— 
value pairs, reducing all those with the same key to a single key—value pair with the total 


count. 


16.5.6 Implementing the Reducer 


In the reducer script (length reducer.py), function tokenize input (lines 8-11) is a 
generator function that reads and splits the key—value pairs produced by the mapper. Again, 
the MapReduce algorithm supplies the standard input. For each line, tokenize input 
strips any leading or trailing whitespace (such as the terminating newline) and yields a list 


containing the key and a value. 


lick here to view code image 


1 #!/usr/bin/env python3 

2 # length_reducer.py 

3 """Counts the number of words with each length, 

4 import sys 

5 from itertools import groupby 

6 from operator import itemgetter 

7 

8 def tokenize input(): 

9 wuNeplit each line of standard input into a key and a value.""" 
10 for line ain Sys- stdin: 

hl yield line.strip().split('\t') 

12 

13 # produce key-value pairs of word lengths and counts separated by tabs 
14 for word_length, group in groupby(tokenize input(), itemgetter(0)): 
15 EEY: 

16 total = sum(int(count) for word length, count in group) 

17 print (word _ length ney a) Vie a EEEE 

18 except ValueError: 

19 pass # ignore word if its count was not an integer 


When the MapReduce algorithm executes this reducer, lines 14—19 use the groupby function 


from the itertools module to group all word lengths of the same value: 


e The first argument calls tokenize input to get the lists representing the key—value 


pairs. 


e The second argument indicates that the key-value pairs should be grouped based on the 
element at index 0 in each list—that is the key. 


Line 16 totals all the counts for a given key. Line 17 outputs a new key—value pair consisting 
of the word and its total. The MapReduce algorithm takes all the final word-count outputs 
and writes them to a file in HDFS—the Hadoop file system. 


16.5.7 Preparing to Run the MapReduce Example 


Next, you'll upload files to the cluster so you can execute the example. In a Command 
Prompt, Terminal or shell, change to the folder containing your mapper and reducer scripts 
and the RomeoAndJuliet.txt file. We assume all three are in this chapter’s ch16 


examples folder, so be sure to copy your RomeoAndJuliet.txt file to this folder first. 


Copying the Script Files to the HDInsight Hadoop Cluster 


Enter the following command to upload the files. Be sure to replace YourClusterName with 
the cluster name you specified when setting up the Hadoop cluster and press Enter only after 
you've typed the entire command. The colon in the following command is required and 
indicates that you'll supply your cluster password when prompted. At that prompt, type the 
password you specified when setting up the cluster, then press Enter: 


lick here to view code image 


scp length_mapper.py length _reducer.py RomeoAndJuliet.txt 
sshuser@ YourClusterName-ssh.azurehdinsight.net: 


The first time you do this, you'll be asked for security reasons to confirm that you trust the 


target host (that is, Microsoft Azure). 


Copying RomeoAnd Juliet into the Hadoop File System 


For Hadoop to read the contents of RomeocAndJuliet.txt and supply the lines of text to 
your mapper, you must first copy the file into Hadoop’s file system. First, you must use ssh ” 
to log into your cluster and access its command line. In a Command Prompt, Terminal or 
shell, execute the following command. Be sure to replace YourClusterName with your cluster 
name. Again, you'll be prompted for your cluster password: 


?Windows users: If ssh does not work for you, install and enable it as described at 

ttps://blogs.msdn.microsoft.com/powershell/2017/12/15/using-the- 
openssh-beta-in-windows-10-fall-creators-update-and-windows-server- 

709/. After completing the installation, log out and log back in or restart your system to 


enable ssh. 


lick here to view code image 
ssh sshuser@ YourClusterName-ssh.azurehdinsight.net 


For this example, we'll use the following Hadoop command to copy the text file into the 
already existing folder /examples/data that the cluster provides for use with Microsoft’s 


Azure Hadoop tutorials. Again, press Enter only when you’ve typed the entire command: 


lick here to view code image 


hadoop fs -copyFromLocal RomeoAndJuliet.txt 
/example/data/RomeoAndJuliet.txt 


16.5.8 Running the MapReduce Job 


Now you can run the MapReduce job for RomeoAndJuliet.txt on your cluster by 
executing the following command. For your convenience, we provided the text of this 
command in the file yarn. txt with this example, so you can copy and paste it. We 


reformatted the command here for readability: 


lick here to view code image 


yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar 
-D mapred.output.key.comparator.class= 
org.apache.hadoop.mapred.lib.KeyFieldBasedComparator 
-D mapred.text.key.comparator.options=-n 
-files length_mapper.py,length_reducer.py 
-mapper length _mapper.py 
-reducer length reducer- Py 
-input /example/data/RomeoAndJuliet.txt 
-output /example/wordlengthsoutput 


The yarn command invokes the Hadoop’s YARN (“yet another resource negotiator”) tool to 
manage and coordinate access to the Hadoop resources the MapReduce task uses. The file 
hadoop-streaming. jar contains the Hadoop streaming utility that allows you to use 


Python to implement the mapper and reducer. The two -D options set Hadoop properties 


that enable it to sort the final key—value pairs by key (KeyFieldBasedComparator) in 
descending order numerically (-n; the minus indicates descending order) rather than 


alphabetically. The other command-line arguments are: 


e -files—A comma-separated list of file names. Hadoop copies these files to every node 


in the cluster so they can be executed locally on each node. 
e -mapper—The name of the mapper’s script file. 
e -reducer—The name of the reducer’s script file 
e -input—tThe file or directory of files to supply as input to the mapper. 


e -output—The HDFS directory in which the output will be written. If this folder already 


exists, an error will occur. 


The following output shows some of the feedback that Hadoop produces as the MapReduce 
job executes. We replaced chunks of the output with to save space and bolded several lines of 


interest including: 


e The total number of “input paths to process”—the 1 source of input in this example is the 
RomeoAndJuliet.txt file. 


e The “number of splits”—2 in this example, based on the number of worker nodes in our 


cluster. 
e The percentage completion information. 
e File System Counters, which include the numbers of bytes read and written. 


e Job Counters, which show the number of mapping and reduction tasks used and 


various timing information. 


e Map-Reduce Framework, which shows various information about the steps performed. 


lick here to view code image 


ackageJobJar: [] [/usr/hdp/2.6.5.3004-13/hadoop-mapreduce/hadoop-streaming-2. 


18/12/05 16:46:25 INFO mapred.FileInputFormat: Total input paths to process 
18/12/05 16:46:26 INFO mapreduce.JobSubmitter: number of splits:2 


18/12/05 16:46:26 INFO mapreduce.Job: The url to track the JOs MEER: ARATOS 


18/12/05 16:46:35 INFO mapreduce.Job: map 0% reduce 0% 
18/12/05 16:46:43 INFO mapreduce.Job: map 50% reduce 0% 
18/12/05 16:46:44 INFO mapreduce.Job: map 100% reduce 0% 
18/12/05 16:46:48 INFO mapreduce.Job: map 100% reduce 100% 
18/12/05 16:46:50 INFO mapreduce.Job: Job job_1543953844228 0025 completed suc 
18/12/05 16:46:50 INFO mapreduce.Job: Counters: 49 
File System Counters 
FILE: Number of bytes read=156411 
FILE: Number of bytes written=813764 


Job Counters 


Launched map tasks=2 
Launched reduce tasks=1 


Map-Reduce Framework 
Map input records=5260 
Map output records=25956 
Map output bytes=104493 
Map output materialized bytes=156417 
Input split bytes=346 
Combine input records=0 
Combine output records=0 
Reduce input groups=19 
Reduce shuffle bytes=156417 
Reduce input records=25956 
Reduce output records=19 
Spilled Records=51912 
Shuffled Maps =2 
Failed Shuffles=0 
Merged Map outputs=2 
GC time elapsed (ms) =193 
CPU time spent (ms) =4440 
Physical memory (bytes) snapshot=1942798336 
Virtual memory (bytes) snapshot=8463282176 
Total committed heap usage (bytes) =3177185280 





18/12/05 16:46:50 INFO streaming.StreamJob: Output directory: /example/wordler 
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iewing the Word Counts 


Hadoop MapReduce saves its output into HDFS, so to see the actual word counts you must 
look at the file in HDFS within the cluster by executing the following command: 


lick here to view code image 


hdfs dfs -text /example/wordlengthsoutput/part-00000 


Here are the results of the preceding command: 


lick here to view code image 


8/12/05 16:47:19 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library 
18/12/05 16:47:19 INFO lzo.LzoCodec: Successfully loaded & initialized native- 
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Deleting Your Cluster So You Do Not Incur Charges 


Caution: Be sure to delete your cluster(s) and associated resources (like 
storage) so you don’t incur additional charges. In the Azure portal, click All 
resources to see your list of resources, which will include the cluster you set up and the 
storage account you set up. Both can incur charges if you do not delete them. Select each 
resource and click the Delete button to remove it. You'll be asked to confirm by typing yes. 


For more information, see: 


ttps://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-po 
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6.6 SPARK 


In this section, we'll overview Apache Spark. We'll use the Python PySpark library and 
Spark’s functional-style filter/map/reduce capabilities to implement a simple word count 


example that summarizes the word counts in Romeo and Juliet. 


16.6.1 Spark Overview 


When you process truly big data, performance is crucial. Hadoop is geared to disk-based 
batch processing—reading the data from disk, processing the data and writing the results 
back to disk. Many big-data applications demand better performance than is possible with 
disk-intensive operations. In particular, fast streaming applications that require either real- 


time or near-real-time processing won’t work in a disk-based architecture. 


History 


Spark was initially developed in 2009 at U. C. Berkeley and funded by DARPA (the Defense 
Advanced Research Projects Agency). Initially, it was created as a distributed execution 
engine for high-performance machine learning. ° It uses an in-memory architecture that 
“has been used to sort 100 TB of data 3X faster than Hadoop MapReduce on 1/10th of the 


» 9 


machines” ° and runs some workloads up to 100 times faster than Hadoop. ° Spark’s 


significantly better performance on batch-processing tasks is leading many companies to 


1 23 


replace Hadoop MapReduce with Spark. `, 


8 ttps://gigaom.com/2014/06/28/4-reasons-why-spark-could-jolt- 


adoop-into-hyperdrive/. 


? ttps://spark.apache.org/faq.html. 


° ttps://spark.apache.org/. 


1 ttps://bigdata-madesimple.com/is-spark-better-than-hadoop-map- 


educe/. 


2 


ttps://www.datanami.com/2018/10/18/is-hadoop-officially-dead/. 


3 ttps://blog.thecodeteam. com/2018/01/09/changing-face-data- 
nalytics-fast-data-displaces-big-data/. 


rchitecture and Components 


Though it was initially developed to run on Hadoop and use Hadoop components like HDFS 
and YARN, Spark can run standalone on a single computer (typically for learning and testing 
purposes), standalone on a cluster or using various cluster managers and distributed storage 
systems. For resource management, Spark runs on Hadoop YARN, Apache Mesos, Amazon 
EC2 and Kubernetes, and it supports many distributed storage systems, including HDFS, 
Apache Cassandra, Apache HBase and Apache Hive. 4 


4 ttp://spark.apache.org/. 


At the core of Spark are resilient distributed datasets (RDDs), which you'll use to 
process distributed data using functional-style programming. In addition to reading data 
from disk and writing data to disk, Hadoop uses replication for fault tolerance, which adds 
even more disk-based overhead. RDDs eliminate this overhead by remaining in memory— 
using disk only if the data will not fit in memory—and by not replicating data. Spark handles 
fault tolerance by remembering the steps used to create each RDD, so it can rebuild a given 
RDD if a cluster node fails. ° 


5 ttps://spark.apache.org/research.html. 


Spark distributes the operations you specify in Python to the cluster’s nodes for parallel 
execution. Spark streaming enables you to process data as it’s received. Spark DataFrames, 
which are similar to pandas Data-Frames, enable you to view RDDs as a collection of 
named columns. You can use Spark DataFrames with Spark SQL to perform queries on 
distributed data. Spark also includes Spark MLlib (the Spark Machine Learning Library), 
which enables you to perform machine-learning algorithms, like those you learned in 

hapters 14 and 5. We'll use RDDs, Spark streaming, DataFrames and Spark SQL in the next 


few examples. 


Providers 


Hadoop providers typically also provide Spark support. In addition to the providers listed in 

ection 16.5, there are Spark-specific vendors like Databricks. They provide a “zero- 
management cloud platform built around Spark.” ° Their website also is an excellent resource 
for learning Spark. The paid Databricks platform runs on Amazon AWS or Microsoft Azure. 
Databricks also provides a free Databricks Community Edition, which is a great way to get 
started with both Spark and the Databricks environment. 


: ttps://databricks.com/product/faq 


16.6.2 Docker and the Jupyter Docker Stacks 


In this section, we'll show how to download and execute a Docker stack containing Spark and 
the PySpark module for accessing Spark from Python. You'll write the Spark example’s code 
in a Jupyter Notebook. First, let’s overview Docker. 


Docker 


Docker is a tool for packaging software into containers (also called images) that bundle 
everything required to execute that software across platforms. Some software packages we 
use in this chapter require complicated setup and configuration. For many of these, there are 


preexisting Docker containers that you can download for free and execute locally on your 
desktop or notebook computers. This makes Docker a great way to help you get started with 


new technologies quickly and conveniently. 


Docker also helps with reproducibility in research and analytics studies. You can create 
custom Docker containers that are configured with the versions of every piece of software and 
every library you used in your study. This would enable others to recreate the environment 
you used, then reproduce your work, and will help you reproduce your results at a later time. 
We'll use Docker in this section to download and execute a Docker container that’s 


preconfigured to run Spark applications. 


Installing Docker 


You can install Docker for Windows 10 Pro or macOS at: 


ttps://www.docker.com/products/docker-desktop 


On Windows 10 Pro, you must allow the "Docker for Windows.exe" installer to make 
changes to your system to complete the installation process. To do so, click Yes when 
Windows asks if you want to allow the installer to make changes to your system. ” Windows 


10 Home users must use Virtual Box as described at: 


7Some Windows users might have to follow the instructions under Allow specific apps to 
make changes to controlled folders at ttps://docs.microsoft.com/en- 
us/windows/security/threat-protection/windows-defender-exploit- 


uard/customize-controlled-folders-exploit-guard. 


ttps://docs.docker.com/machine/drivers/virtualbox/ 


Linux users should install Docker Community Edition as described at: 


ttps://docs.docker.com/install/overview/ 


For a general overview of Docker, read the Getting started guide at: 


ttps://docs.docker.com/get-started/ 


Jupyter Docker Stacks 


The Jupyter Notebooks team has preconfigured several Jupyter “Docker stacks” containers 
for common Python development scenarios. Each enables you to use Jupyter Notebooks to 
experiment with powerful capabilities without having to worry about complex software setup 
issues. In each case, you can open JupyterLab in your web browser, open a notebook in 
JupyterLab and start coding. JupyterLab also provides a Terminal window that you can 
use in your browser like your computer’s Terminal, Anaconda Command Prompt or shell. 
Everything we’ve shown you in IPython to this point can be executed using IPython in 
JupyterLab’s Terminal window. 


We'll use the jupyter/pyspark-notebook Docker stack, which is preconfigured with 


everything you need to create and test Apache Spark apps on your computer. When combined 
with installing other Python libraries we’ve used throughout the book, you can implement 
most of this book’s examples using this container. For more about the available Docker 
stacks, visit: 


ttps://jupyter-docker-stacks.readthedocs.io/en/latest/index.html 


Run Jupyter Docker Stack 


Before performing the next step, ensure that JupyterLab is not currently running on your 
computer. Let’s download and run the jupyter/pyspark-notebook Docker stack. To 
ensure that you do not lose your work when you close the Docker container, we'll attach a 
local file-system folder to the container and use it to save your notebook—Windows users 
should replace \ with ^. : 


lick here to view code image 


docker run -p 8888:8888 -p 4040:4040 -it --user root \ 
-v fullPathToTheFolderYouwWantToUse:/home/jovyan/work \ 
jupyter/pyspark-notebook:14fdfbf9cfcl start.sh jupyter lab 


The first time you run the preceding command, Docker will download the Docker container 


named: 


lick here to view code image 


jupyter/pyspark-notebook:14fdfbf9cfcl 


The notation “:14fdfbf9cfc1” indicates the specific j upyter/pyspark-notebook 
container to download. At the time of this writing, 14fdfbf9cfcl was the newest version of 
the container. Specifying the version as we did here helps with reproducibility. Without the 
":14fdfbf9cfc1" in the command, Docker will download the latest version of the 
container, which might contain different software versions and might not be compatible with 
the code you're trying to execute. The Docker container is nearly 6GB, so the initial download 


time will depend on your Internet connection’s speed. 


Opening JupyterLab in Your Browser 


Once the container is downloaded and running, you'll see a statement in your Command 


Prompt, Terminal or shell window like: 


lick here to view code image 


Copy/paste this URL into your browser when you connect for the first time, to 1 











df | > 


http:// (bb00eb337630 or 127.0.0.1) :8888/?token= 
9570295e90ee 94ecef75568b95545b7910a8 £5502e€6f5680 


Copy the long hexadecimal string (the string on your system will differ from this one): 


9570295e90ee94ecef75568b95545b7910a8F5502e6f5680 


then open http: //localhost:8888/1ab in your browser (localhost corresponds to 
127.0.0.1 in the preceding output) and paste your token in the Password or token field. 
Click Log in to be taken to the JupyterLab interface. If you accidentally close your browser, 
go to http: //localhost:8888/1ab to continue your session. 


When running in this Docker container, the work folder in the Files tab at the left side of 
JupyterLab represents the folder you attached to the container in the docker run 
command’s -v option. From here, you can open the notebook files we provide for you. Any 
new notebooks or other files you create will be saved to this folder by default. Because the 
Docker container’s work folder is connected to a folder on your computer, any files you 
create in JupyterLab will remain on your computer, even if you decide to delete the Docker 


container. 


Accessing the Docker Container’s Command Line 


Each Docker container has a command-line interface like the one you’ve used to run [Python 
throughout this book. Via this interface, you can install Python packages into the Docker 
container and even use [Python as you’ve done previously. 


Open a separate Anaconda Command Prompt, Terminal or shell and list the currently 


running Docker containers with the command: 


docker ps 


The output of this command is wide, so the lines of text will likely wrap, as in: 


lick here to view code image 


CONTAINER ID IMAGE COMMAND 
CREATED STATUS PORTS 
NAMES 
f£54f62b7e6d5 jupyter/pyspark-notebook:14fdfbf9cfcl Ten, =o) == 
/bin/bash” 2 minutes ago Up 2 minutes 0.0.0.0:8888->8888/tcp 


friendly pascal 





In the last line of our system’s output under the column head NAMES in the third line is the 
name that Docker randomly assigned to the running container—friendly pascal—the 
name on your system will differ. To access the container’s command line, execute the 


following command, replacing container_name with the running container’s name: 


docker exec -it container name /bin/bash 


The Docker container uses Linux under the hood, so you'll see a Linux prompt where you can 


enter commands. 


The app in this section will use features of the NLTK and TextBlob libraries you used in the 


“ 


atural Language Processing” chapter. Neither is preinstalled in the Jupyter Docker stacks. 
To install NLTK and TextBlob enter the command: 


conda install -c conda-forge nltk textblob 


Stopping and Restarting a Docker Container 


Every time you start a container with docker run, Docker gives you a new instance that 
does not contain any libraries you installed previously. For this reason, you should keep track 
of your container name, so you can use it from another Anaconda Command Prompt, 
Terminal or shell window to stop the container and restart it. The command 


docker stop container name 
will shut down the container. The command 
docker restart container name 


will restart the container. Docker also provides a GUI app called Kitematic that you can use to 
manage your containers, including stopping and restarting them. You can get the app from 
ttps://kitematic.com/ and access it through the Docker menu. The following user 


guide overviews how to manage containers with the tool: 


ttps://docs.docker.com/kitematic/userguide/ 


16.6.3 Word Count with Spark 


In this section, we'll use Spark’s filtering, mapping and reducing capabilities to implement a 
simple word count example that summarizes the words in Romeo and Juliet. You can work 
with the existing notebook named RomeoAndJulietCounter. ipynb in the 
SparkWordCount folder (into which you should copy your RomeoAndJuliet.txt file from 
the“ atural Language Processing” chapter), or you can create a new notebook, then enter 
and execute the snippets we show. 


Loading the NLTK Stop Words 


In this app, we'll use techniques you learned in the “ atural Language Processing” chapter to 
eliminate stop words from the text before counting the words’ frequencies. First, download 
the NLTK stop words: 


lick here to view code image 


HEr import MEER 
nltk.download('stopwords') 
[nltk_data] Downloading package stopwords to /home/jovyan/nltk data... 
[nltk_data] Package stopwords is already up-to-date! 
{1]: True 


Next, load the stop words: 


lick here to view code image 


Lie from niek- -corpus Import stopwords 


stop_words = stopwords.words('english') 


Configuring a SparkContext 


A SparkContext (from module pyspark) object gives you access to Spark’s capabilities in 
Python. Many Spark environments create the SparkContext for you, but in the Jupyter 


pyspark-notebook Docker stack, you must create this object. 


First, let’s specify the configuration options by creating a SparkConé£ object (from module 
pyspark). The following snippet calls the object’s setAppName method to specify the Spark 
application’s name and calls the object’s setMaster method to specify the Spark cluster’s 
URL. The URL 'local [*] ' indicates that Spark is executing on your local computer (as 
opposed to a cloud-based cluster), and the asterisk indicates that Spark should run our code 
using the same number of threads as there are cores on the computer: 


lick here to view code image 


[3]: from pyspark import SparkConf 
configuration = SparkConf().setAppName ('RomeoAndJulietCounter') \ 
,ssetMaster(*local[*]*) 


Threads enable a single node cluster to execute portions of the Spark tasks concurrently to 
simulate the parallelism that Spark clusters provide. When we say that two tasks are 
operating concurrently, we mean that they’re both making progress at once—typically by 
executing a task for a short burst of time, then allowing another task to execute. When we say 
that two tasks are operating in parallel, we mean that they’re executing simultaneously, 
which is one of the key benefits of Hadoop and Spark executing on cloud-based clusters of 


computers. 
Next, create the SparkContext, passing the SparkConf as its argument: 


lick here to view code image 


[4]: from pyspark import SparkContext 
sc = SparkContext (conf=configuration) 


Reading the Text File and Mapping It to Words 


You work with a SparkContext using functional-style programming techniques, like 
filtering, mapping and reduction, applied to a resilient distributed dataset (RDD). An 
RDD takes data stored throughout a cluster in the Hadoop file system and enables you to 
specify a series of processing steps to transform the data in the RDD. These processing steps 
are lazy ( hapter 5)—they do not perform any work until you indicate that Spark should 


process the task. 


The following snippet specifies three steps: 


e SparkContext method textFile loads the lines of text from RomeoAndJuliet.txt 


and returns it as an RDD (from module pyspark) of strings that represent each line. 


e RDD method map uses its lambda argument to remove all punctuation with TextBlob’s 


strip punc function and to convert each line of text to lowercase. This method returns 


a new RDD on which you can specify additional tasks to perform. 


e RDD method flatMap uses its Lambda argument to map each line of text into its words 
and produces a single list of words, rather than the individual lines of text. The result of 


flatMap is a new RDD representing all the words in Romeo and Juliet. 


lick here to view code image 


[5]: from textblob.utils import strip pune 
tokenized = sc.textFile('RomeoAndJuliet.txt')\ 
-map (lambda line: strip punc (line, all=True) .lower())\ 
. flatMap (lambda lines lrnerspl rt) 


Removing the Stop Words 


Next, let’s use RDD method filter to create a new RDD with no stop words remaining: 


lick here to view code image 


[6]: filtered = tokenized.filter(lambda word: word not in stop_words) 


Counting Each Remaining Word 


Now that we have only the non-stop-words, we can count the number of occurrences of each 
word. To do so, we first map each word to a tuple containing the word and a count of 1. This 
is similar to what we did in Hadoop MapReduce. Spark will distribute the reduction task 
across the cluster’s nodes. On the resulting RDD, we then call the method reduceByKey, 
passing the operator module’s add function as an argument. This tells method 


reduceBykKey to add the counts for tuples that contain the same word (the key): 


lick here to view code image 


[7]: from operator import add 


word counts = filtered.map (lambda word: (word, 1)).reduceByKey (add) 


Locating Words with Counts Greater Than or Equal to 60 


Since there are hundreds of words in Romeo and Juliet, let’s filter the RDD to keep only those 


words with 60 or more occurrences: 


lick here to view code image 


[Slits filtered counts = word_counts.filter(lambda item: item[1] >= 60) 


Sorting and Displaying the Results 


At this point, we've specified all the steps to count the words. When you call RDD method 
collect, Spark initiates all the processing steps we specified above and returns a list 
containing the final results—in this case, the tuples of words and their counts. From your 
perspective, everything appears to execute on one computer. However, if the SparkContext 


is configured to use a cluster, Spark will divide the tasks among the cluster’s worker nodes for 
you. In the following snippet, sort in descending order (reverse=True) the list of tuples by 


their counts (itemgetter (1) ). 


The following snippet calls method collect to obtain the results and sorts those results in 


descending order by word count: 


lick here to view code image 


[9]: from operator import itemgetter 
sorted items = sorted (filtered! counts: collect (iy 
key=itemgetter(1), reverse=True) 


Finally, let’s display the results. First, we determine the word with the most letters so we can 
right-align all the words in a field of that length, then we display each word and its count: 


lick here to view code image 


[10]: max_len = max([len(word) for word, count in sorted _items]) 
for word, count in sorted items: 
print (f'{word:>{max_len}}: {count}') 
(Lalo es romeo: 298 
chous 277 
juliet: 178 
thy 270s 
nurse: 146 


capulet: 41 





love: 136 
thee: 135 
shall: 10 
lady: 09 
Entar: 04 
come: 94 


mercutio: 83 
good: 80 
benvolio: 79 
enter: 75 


go: 75 

ety aba yl 
typa lt: 169 
death: 69 
hight: 68 
lawrence: 67 
man: 65 
hath: 64 
one: 60 


16.6.4 Spark Word Count on Microsoft Azure 


As we said previously, we want to expose you to both tools you can use for free and real-world 
development scenarios. In this section, you'll implement the Spark word-count example on a 
Microsoft Azure HDInsight Spark cluster. 


Create an Apache Spark Cluster in HDInsight Using the Azure Portal 


The following link explains how to set up a Spark cluster using the HDInsight service: 


ttps://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-spal 











While following the Create an HDInsight Spark cluster steps, note the same issues we 
listed in the Hadoop cluster setup earlier in this chapter and for the Cluster type select 
Spark. 


Again, the default cluster configuration provides more resources than you need for our 
examples. So, in the Cluster summary, perform the steps shown in the Hadoop cluster 
setup to change the number of worker nodes to 2 and to configure the worker and head nodes 
to use D3 v2 computers. When you click Create, it takes 20 to 30 minutes to configure and 
deploy your cluster. 


Install Libraries into a Cluster 


If your Spark code requires libraries that are not installed in the HDInsight cluster, you'll 
need to install them. To see what libraries are installed by default, you can use ssh to log into 
your cluster (as we showed earlier in the chapter) and execute the command: 


/usr/bin/anaconda/envs/py35/bin/conda list 


Since your code will execute on multiple cluster nodes, libraries must be installed on every 
node. Azure requires you to create a Linux shell script that specifies the commands to install 
the libraries. When you submit that script to Azure, it validates the script, then executes it on 
every node. Linux shell scripts are beyond this book’s scope, and the script must be hosted on 
a web server from which Azure can download the file. So, we created an install script for you 
that installs the libraries we use in the Spark examples. Perform the following steps to install 
these libraries: 


1. In the Azure portal, select your cluster. 
2. In the list of items under the cluster’s search box, click Script Actions. 


3. Click Submit new to configure the options for the library installation script. For the 
Script type select Custom, for the Name specify Libraries and for the Bash script 
URI use: 


ttp://deitel.com/bookresources/IntroToPython/install libraries.sh 





4. Check both Head and Worker to ensure that the script installs the libraries on all the 
nodes. 


5. Click Create. 


When the cluster finishes executing the script, if it executed successfully, you'll see a green 
check next to the script name in the list of script actions. Otherwise, Azure will notify you that 


there were errors. 


Copying RomeoAndJuliet.txt to the HDInsight Cluster 


As you did in the Hadoop demo, let’s use the scp command to upload to the cluster the 
RomeoAndJuliet.txt file you used inthe“ atural Language Processing” chapter. In a 


Command Prompt, Terminal or shell, change to the folder containing the file (we assume this 


chapter’s ch16 folder), then enter the following command. Replace YourClusterName with 
the name you specified when creating your cluster and press Enter only when you've typed 
the entire command. The colon is required and indicates that you'll supply your cluster 
password when prompted. At that prompt, type the password you specified when setting up 
the cluster, then press Enter: 


scp RomeoAndJuliet.txt sshuser@YourClusterName-ssh.azurehdinsight.net: 


Next, use ssh to log into your cluster and access its command line. In a Command Prompt, 
Terminal or shell, execute the following command. Be sure to replace YourClusterName with 


your cluster name. Again, you'll be prompted for your cluster password: 
ssh sshuser@YourClusterName-ssh.azurehdinsight.net 


To work with the RomeoAndJuliet.txt file in Spark, first use the ssh session to copy the 
file into the cluster’s Hadoop’s file system by executing the following command. Once again, 
we'll use the already existing folder /examples/data that Microsoft includes for use with 


HDiInsight tutorials. Again, press Enter only when you've typed the entire command: 


hadoop fs -copyFromLocal RomeoAndJuliet.txt 
/example/data/RomeoAndJuliet.txt 


Accessing Jupyter Notebooks in HDInsight 


At the time of this writing, HDInsight uses the old Jupyter Notebook interface, rather than 
the newer JupyterLab interface shown earlier. For a quick overview of the old interface see: 


ttps://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Notebook%20 
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o access Jupyter Notebooks in HDInsight, in the Azure portal select All resources, then 
your cluster. In the Overview tab, select Jupyter notebook under Cluster dashboards. 
This opens a web browser window and asks you to log in. Use the username and password 
you specified when setting up the cluster. If you did not specify a username, the default is 
admin. Once you log in, Jupyter displays a folder containing PySpark and Scala 


subfolders. These contain Python and Scala Spark tutorials. 


Uploading the RomeoAndJulietCounter.ipynb Notebook 


You can create new notebooks by clicking New and selecting PySpark3, or you can upload 
existing notebooks from your computer. For this example, let’s upload the previous section’s 
RomeoAndJulietCounter. ipynb notebook and modify it to work with Azure. To do so, 
click the Upload button, navigate to the ch16 example folder’s SparkWordCount folder, 
select RomeoAndJulietCounter.ipynb and click Open. This displays the file in the 
folder with an Upload button to its right. Click that button to place the notebook in the 
current folder. Next, click the notebook’s name to open it in a new browser tab. Jupyter will 
display a Kernel not found dialog. Select PySparkg and click OK. Do not run any cells 
yet. 


Modifying the Notebook to Work with Azure 


Perform the following steps, executing each cell as you complete the step: 


1. The HDInsight cluster will not allow NLTK to store the downloaded stop words in NLTK’s 
default folder because it’s part of the system’s protected folders. In the first cell, modify 
the call nltk.download('stopwords') as follows to store the stop words in the 


current folder (' . '): 


nltk.download('stopwords', download _dir='.') 


When you execute the first cell, Starting Spark application appears below the cell 
while HDInsight sets up a SparkContext object named sc for you. When this task is 


complete, the cell’s code executes and downloads the stop words. 


2. In the second cell, before loading the stop words, you must tell NLTK that they’re located 
in the current folder. Add the following statement after the import statement to tell 
NLTK to search for its data in the current folder: 


nltk.data.path.append('.') 


3. Because HDInsight sets up the SparkContext object for you, the third and fourth cells 
of the original notebook are not needed, so you can delete them. To do so, either click 
inside it and select Delete Cells from Jupyter’s Edit menu, or click in the white margin 
to the cell’s left and type dd. 


4. In the next cell, specify the location of RomeoAndJuliet.txt in the underlying Hadoop 
file system. Replace the string 'RomeoAndJuliet.txt' with the string 


'wasb:///example/data/RomeoAndJuliet.txt' 


The notation wasb: /// indicates that RomeoAndJuliet.txt is stored in a Windows 
Azure Storage Blob (WASB)—Azure’s interface to the HDFS file system. 


5. Because Azure currently uses Python 3.5.x, it does not support f-strings. So, in the last 
cell, replace the f-string with the following older-style Python string formatting using the 
string method format: 


print('{:>{width}}: {}'.format(word, count, width=max_len) ) 


You'll see the same final results as in the previous section. 


Caution: Be sure to delete your cluster and other resources when you’re done 


with them, so you do not incur charges. For more information, see: 


ttps://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-po 
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ote that when you delete your Azure resources, your notebooks will be deleted as well. You 


can download the notebook you just executed by selecting File > Download as > 
Notebook (.ipynb) in Jupyter. 


16.7 SPARK STREAMING: COUNTING TWITTER HASHTAGS 
USING THE PYSPARK-NOTEBOOK DOCKER STACK 


In this section, you'll create and run a Spark streaming application in which you'll receive a 
stream of tweets on the topic(s) you specify and summarize the top-20 hashtags in a bar chart 
that updates every 10 seconds. For this purpose of this example, you'll use the Jupyter Docker 
container from the first Spark example. 


There are two parts to this example. First, using the techniques from the “Data Mining 
Twitter” chapter, you'll create a script that streams tweets from Twitter. Then, we'll use Spark 
streaming in a Jupyter Notebook to read the tweets and summarize the hashtags. 


The two parts will communicate with one another via networking sockets—a low-level view 
of client/server networking in which a client app communicates with a server app over a 
network using techniques similar to file I/O. A program can read from a socket or write to a 
socket similarly to reading from a file or writing to a file. The socket represents one endpoint 
of a connection. In this case, the client will be a Spark application, and the server will be a 
script that receives streaming tweets and sends them to the Spark app. 


Launching the Docker Container and Installing Tweepy 


For this example, you'll install the Tweepy library into the Jupyter Docker container. Follow 
ection 16.6.2’s instructions for launching the container and installing Python libraries into 
it. Use the following command to install Tweepy: 


pip install tweepy 


16.7.1 Streaming Tweets to a Socket 


The script starttweetstream.py contains a modified version of the TweetListener 
class from the “Data Mining Twitter” chapter. It streams the specified number of tweets and 
sends them to a socket on the local computer. When the tweet limit is reached, the script 
closes the socket. You’ve already used Twitter streaming, so we'll focus only on what’s new. 
Ensure that the file keys. py (in the ch16 folder’s SparkHashtagSummarizer subfolder) 


contains your Twitter credentials. 


Executing the Script in the Docker Container 


In this example, you'll use JupyterLab’s Terminal window to execute 
starttweetstream.py in one tab, then use a notebook to perform the Spark task in 


another tab. With the Jupyter pyspark-notebook Docker container running, open 
http://localhost:8888/lab 


in your web browser. In JupyterLab, select File > New > Terminal to open a new tab 
containing a Terminal. This is a Linux-based command line. Typing the 1s command and 


pressing Enter lists the current folder’s contents. By default, you'll see the container’s work 


folder. 


To execute starttweetstream.py, you must first navigate to the 


SparkHashtagSummarizer folder with the command 8. 


8Windows users should note that Linux uses / rather than \ to separate folders and that file 


and folder names are case sensitive. 
cd work/SparkHashtagSummarizer 

You can now execute the script with the command of the form 
ipython starttweetstream.py number of tweets search terms 


where number_of_tweets specifies the total number of tweets to process and search_terms 
one or more space-separated strings to use for filtering tweets. For example, the following 
command would stream 1000 tweets about football: 


ipython starttweetstream.py 1000 football 


At this point, the script will display "Waiting for connection" and will wait until Spark 


connects to begin streaming the tweets. 


starttweetstream.py import Statements 


For discussion purposes, we’ve divided starttweetstream. py into pieces. First, we 
import the modules used in the script. The Python Standard Library's socket module 


provides the capabilities that enable Python apps to communicate via sockets. 


lick here to view code image 


# starttweetstream.py 

nemscript to get tweets on topic(s) specified as script argument (s) 
and send tweet text to a socket for processing by Spark. vm 

import keys 

import socket 

import sys 


import tweepy 


or nou &®F WD EB 


Class TweetListener 


Once again, you’ve seen most of the code in class Tweet Listener, so we focus only on 


what’s new here: 


e Method init__ (lines 12-17) now receives a connection parameter representing the 
socket and stores it in the self. connection attribute. We use this socket to send the 


hashtags to the Spark application. 


e In method on_status (lines 24—44), lines 27—32 extract the hashtags from the Tweepy 


Status object, convert them to lowercase and create a space-separated string of the 


hashtags to send to Spark. The key statement is line 39: 


self.connection.send(hashtags string.encode('utf-8') ) 


which uses the connection object’s send method to send the tweet text to whatever 
application is reading from that socket. Method send expects as its argument a sequence of 
bytes. The string method call encode ('utf-8') converts the string to bytes. Spark will 


automatically read the bytes and reconstruct the strings. 


lick here to view code image 


9 class TweetListener (tweepy.StreamListener): 











10 """Handles incoming Tweet stream." "" 
11 
12 deel inie (self api Connect ron Tim t=g000 
13 """Create instance variables for tracking number of tweets.""" 
14 self.connection = connection 
15 self.tweet_count = 0 
16 self.TWEET LIMIT = limit # 10,000 by default 
L7 super() ._ init (api) # call superclass's init 
18 
19 def on connect (self): 
20 """Called when your connection attempt is successful, enabling 
21 you to perform appropriate application tasks at that point wm 
22 print (Successfully connected to Twitter\n') 
23 
24 def on_status(self, status): 
25 """Called when Twitter pushes a new tweet to you.""" 
26 # get the hashtags 
27 hashtags = [] 
28 
29 for hashtag_dict in status.entities['hashtags']: 
30 hashtags.append(hashtag_dict['text'].lower () 
31 
32 hashtags_ string = ' VY join(hashtags): + Nn! 
33 print (f£'Screen name: {status.user.screen name}:') 
34 jowenln gles (ew Hashtags: {hashtags_string}') 
35 self.tweet_count += 1 # track number of tweets processed 
36 
37 try: 
38 # send requires bytes, so encode the string in utf-8 format 
39 self.connection.send(hashtags_ string.encode('utf-8') ) 
40 except Exception as e: 
41 Prina(E Error: dep") 
42 
43 # if TWEET LIMIT is reached, return False to terminate streaming 
44 tretura self menset count = Sele TWEET ETMEN 
45 
46 def on error(selt, status): 
47 print (status) 
48 return True 
49 
7 -5 
Main Application 


Lines 50—80 execute when you run the script. You’ve connected to Twitter to stream tweets 


previously, so here we discuss only what’s new in this example. 


Line 51 gets the number of tweets to process by converting the command-line argument 


sys.argv[1] toan integer. Recall that element 0 represents the script’s name. 


lick here to view code image 


50 if name == ' main ': 





S51 tweet_limit = int(sys.argv[1]) # get maximum number of tweets 


Line 52 calls the socket module’s socket function, which returns a socket object that 


we'll use to wait for a connection from the Spark application. 


lick here to view code image 


52 client socket = socket.socket() # create a socket 
53 


Line 55 calls the socket object’s bind method with a tuple containing the hostname or IP 
address of the computer and the port number on that computer. Together these represent 


where this script will wait for an initial connection from another app: 


lick here to view code image 


54 # app will use localhost (this computer) port 9876 
55 client socket. bind ("localhost 9876) 
56 


Line 58 calls the socket’s listen method, which causes the script to wait until a connection 
is received. This is the statement that prevents the Twitter stream from starting until the 


Spark application connects. 


lick here to view code image 


57 print ("Wanting for connection” ) 
58 client _socket.listen() # wait for client to connect 
59 


Once the Spark application connects, line 61 calls socket method accept, which accepts the 
connection. This method returns a tuple containing a new socket object that the script will 
use to communicate with the Spark application and the IP address of the Spark application’s 


computer. 


lick here to view code image 


60 # when connection received, get connection/client address 
61 connection, address = client _socket.accept () 

62 print (f'Connection received from {address}') 

63 


Next, we authenticate with Twitter and start the stream. Lines 73—74 set up the stream, 
passing the socket object connection to the Tweet Listener so that it can use the socket 


to send hashtags to the Spark application. 


lick here to view code image 


64 
65 
66 
67 
68 
69 
70 
71 
72 
73 
74 
T79 
76 
T 
78 


# 


configure Twitter access 


auth = tweepy.OAuthHandler(keys.consumer_key, keys.consumer_ secret) 


auth.set_access_token(keys.access_ token, keys.access_ token secret) 


# 


configure Tweepy to wait if Twitter rate limits are reached 


api = tweepy.API (auth, wait on rate limit=True, 


# 


wait on rate limit notify=True) 





create the Stream 


twitter stream = tweepy.Stream(api.auth, 


# 


TweetListener (api, connection, tweet limit) ) 


sys.argv[2] is the first search term 


twitter stream. filter (track=sys.argv[2:]) 
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Finally, lines 79-80 call the close method on the socket objects to release their resources. 


lick here to view code image 


79 
80 


connection.close() 


client_socket.close() 


16.7.2 Summarizing Tweet Hashtags; Introducing Spark SQL 


In this section, you'll use Spark streaming to read the hashtags sent via a socket by the script 


starttweetstream.py and summarize the results. You can either create a new notebook 


and enter the code you see here or load the hashtagsummarizer.ipynb notebook we 


provide in the ch16 examples folder’s SparkHashtagSummarizer subfolder. 


Importing the Libraries 


First, let’s import the libraries used in this notebook. We'll explain the pyspark classes as we 


use them. From IPython, we imported the display module, which contains classes and 


utility functions that you can use in Jupyter. In particular, we'll use the clear output 


function to remove an existing chart before displaying a new one: 


lick here to view code image 


from 
from 
from 


from 


pyspark import SparkContext 
pyspark.streaming import StreamingContext 
pyspark.sql import Row, SparkSession 
iPython, import display 


import matplotlib.pyplot as plt 


import seaborn as sns 


Smatplotlib inline 


This Spark application summarizes hashtags in 10-second batches. After processing each 


batch, it displays a Seaborn barplot. The IPython magic 


Smatplotlib inline 


indicates that Matplotlib-based graphics should be displayed in the notebook rather than in 
their own windows. Recall that Seaborn uses Matplotlib. 


We've used several IPython magics throughout the book. There are many magics specifically 


for use in Jupyter Notebooks. For the complete list of magics see: 


ttps://ipython.readthedocs.io/en/stable/interactive/magics.html 


Utility Function to Get the SparkSession 


As youll soon see, you can use Spark SQL to query data in resilient distributed datasets 
(RDDs). Spark SQL uses a Spark DataFrame to get a table view of the underlying RDDs. A 


SparkSession (module pyspark. sql) is used to create a DataFrame from an RDD. 


There can be only one SparkSession object per Spark application. The following function, 
which we borrowed from the Spark Streaming Programming Guide, ° defines the correct 
way to get a SparkSession instance if it already exists or to create one if it does not yet 


exist: ° 


? ttps://spark.apache.org/docs/latest/streaming-programming- 


uide.html#dataframe-and-sql-operations. 


Because this function was borrowed from the Spark Streaming Programming Guides 

DataFrame and SQL Operations section 

( ttps://spark.apache.org/docs/latest/streaming-programming- 
uide.html#dataframe-and-sql-operations), we did not rename it to use Pythons 


standard function naming style, and we did not use single quotes to delimit strings. 


lick here to view code image 


[2]: def getSparkSessionInstance (sparkConf): 
"Spark Streaming Programming Guide's recommended method 
for getting an existing SparkSession or creating a new one.""" 
if ("sparkSessionSingletonInstance" not in globals{({)): 
globals() ["sparkSessionSingletonInstance"] = SparkSession \ 
-builder y 
.config (conf=sparkConf) N 
-getOrCreate () 
return globals() ["SparkSessionSingletonInstance"] 


Utility Function to Display a Barchart Based on a Spark DataFrame 


We call function display _barplot after Spark processes each batch of hashtags. Each call 
clears the previous Seaborn barplot, then displays a new one based on the Spark DataFrame 
it receives. First, we call the Spark DataFrame’s toPandas method to convert it to a pandas 
DataFrame for use with Seaborn. Next, we call the clear_output function from the 
IPython.display module. The keyword argument wait=True indicates that the function 
should remove the prior graph (if there is one), but only once the new graph is ready to 
display. The rest of the code in the function uses standard Seaborn techniques we’ve shown 
previously. The function call sns.color palette('cool', 20) selects twenty colors 


from the Matplotlib 'cool' color palette: 


lick here to view code image 


[3]: def display barplot(spark df, x, y, time, scale=2.0, size=(16, 9)): 
eUrDisplays: a Spark DataFrame’s contents as a bar plotr.""" 
dE = spark _df.toPandas() 
# remove prior graph when new one is ready to display 


display.clear output (wait=True) 
print(f'TIME: {time}') 


# create and configure a Figure containing a Seaborn barplot 

plt.figure (figsize=size) 

sns.set (font _scale=scale) 

barplot = sns.barplot (data=df, x=x, y=y 
palette=sns;color pallette (“cool", 20) 


# rotate the x-axis labels 90 degrees for readability 
for item in barplot.get_xticklabels(): 


item.set_rotation (90 


plt.tight_layout () 
plt.show() 


Utility Function to Summarize the Top-20 Hashtags So Far 


In Spark streaming, a DStream is a sequence of RDDs each representing a mini-batch of data 
to process. As you'll soon see, you can specify a function that is called to perform a task for 
every RDD in the stream. In this app, the function count_tags will summarize the hashtag 
counts in a given RDD, add them to the current totals (maintained by the SparkSession), 
then display an updated top-20 barplot so that we can see how the top-20 hashtags are 
changing over time. * For discussion purposes, we’ve broken this function into smaller pieces. 
First, we get the SparkSession by calling the utility function 
getSparkSessionInstance with the SparkContext’s configuration information. Every 


RDD has access to the SparkContext via the context attribute: 


*When this function gets called the first time, you might see an exceptions error message 
display if no tweets with hashtags have been received yet. This is because we simply display 
the error message in the standard output. That message will disappear as soon as there are 
tweets with hashtags. 


lick here to view code image 


[4]: def count _tags(time, rdd): 
ver Count hashtags and display top-20 in descending order.""" 
CEY: 
# get SparkSession 
spark = getSparkSessionInstance(rdd.context.getConf () ) 


Next, we call the RDD’s map method to map the data in the RDD to Row objects (from the 
pyspark.sql package). The RDDs in this example contain tuples of hashtags and counts. 
The Row constructor uses the names of its keyword arguments to specify the column names 
for each value in that row. In this case, tag [0] is the hashtag in the tuple, and tag[1] is the 
total count for that hashtag: 


lick here to view code image 


# map hashtag string<-count tuples to Rows 
rows = rdd.map ( 
lambda tag: Row (hashtag=tag[0], total=tag[1]) 


The next statement creates a Spark DataFrame containing the Row objects. We'll use this 


with Spark SQL to query the data to get the top-20 hashtags with their total counts: 


lick here to view code image 


# create a DataFrame from the Row objects 
hashtags df = spark.createDataFrame (rows) 


To query a Spark DataFrame, first create a table view, which enables Spark SQL to query the 
DataFrame like a table in a relational database. Spark DataFrame method 
createOrReplaceTempView creates a temporary table view for the DataFrame and 


names the view for use in the from clause of a query: 


lick here to view code image 


# create a temporary table view for use with Spark SQL 
hashtags df.createOrReplaceTempView('hashtags') 


Once you have a table view, you can query the data using Spark SQL. ° The following 
statement uses the SparkSession instance’s sql method to perform a Spark SQL query 
that selects the hashtag and total columns fromthe hashtags table view, orders the 
selected rows by total in descending (desc) order, then returns the first 20 rows of the 


result (Limit 20). Spark SQL returns a new Spark DataFrame containing the results: 
? or details of Spark SQLs syntax, see ttps://spark.apache.org/sql/. 


lick here to view code image 


# use Spark SQL to get top 20 hashtags in descending order 
top20 df = spark.sql ( 

wmeNselect hashtag, total 

from hashtags 

order by total, hashtag desc 

Mimik AGEN 


Finally, we pass the Spark DataFrame to our display barplot utility function. The 
hashtags and totals will be displayed on the x- and y-axes, respectively. We also display the 
time at which count_tags was called: 


lick here to view code image 


display barplot(top20 df, x='hashtag', y='total', time=time) 
except Exception as e: 
print(f'Exception: {e}') 











Getting the SparkContext 


The rest of the code in this notebook sets up Spark streaming to read text from the 
starttweetstream.py script and specifies how to process the tweets. First, we create the 


SparkContext for connecting to the Spark cluster: 


[5]: sc = SparkContext () 


Getting the StreamingContext 


For Spark streaming, you must create a StreamingContext (module 
pyspark.streaming), providing as arguments the SparkContext and how often in 
seconds to process batches of streaming data. In this app, we'll process batches every 10 


seconds—this is the batch interval: 


lick here to view code image 


[6]: ssc = StreamingContext(sc, 10) 


Depending on how fast data is arriving, you may wish to shorten or lengthen your batch 
intervals. For a discussion of this and other performance-related issues, see the Performance 
Tuning section of the Spark Streaming Programming Guide: 


ttps://spark.apache.org/docs/latest/streaming-programming-guide.html#performan 





i | > 











etting Up a Checkpoint for Maintaining State 


By default, Spark streaming does not maintain state information as you process the stream of 
RDDs. However, you can use Spark checkpointing to keep track of the streaming state. 
Checkpointing enables: 


e fault-tolerance for restarting a stream in cases of cluster node or Spark application 


failures, and 


e stateful transformations, such as summarizing the data received so far—as we're doing in 


this example. 


StreamingContext method checkpoint sets up the checkpointing folder: 


lick here to view code image 
[7]: ssc.checkpoint ('hashtagsummarizer checkpoint') 


For a Spark streaming application in a cloud-based cluster, you’d specify a location within 
HDFS to store the checkpoint folder. We’re running this example in the local Jupyter Docker 
image, so we simply specified the name of a folder, which Spark will create in the current 
folder (in our case, the ch16 folder’s SoarkHashtagSummarizer). For more details on 


checkpointing, see 


ttps://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpoin 
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onnecting to the Stream via a Socket 


StreamingContext method socketTextStream connects to a socket from which a 
stream of data will be received and returns a DSt ream that receives the data. The method’s 
arguments are the hostname and port number to which the StreamingContext should 
connect—these must match where the starttweetstream. py script is waiting for the 


connection: 


lick here to view code image 


[8]: stream = ssc.socketTextStream('localhost', 9876) 


Tokenizing the Lines of Hashtags 


We use functional-style programming calls on a DSt ream to specify the processing steps to 
perform on the streaming data. The following call to DSt ream’s £latMap method tokenizes 
a line of space-separated hashtags and returns a new DSt ream representing the individual 


tags: 


lick here to view code image 


[9]: tokenized = stream.flatMap(lambda line: line.split()) 


Mapping the Hashtags to Tuples of Hashtag-Count Pairs 


Next, similar to the Hadoop mapper earlier in this chapter, we use DSt ream method map to 
get a new DSt ream in which each hashtag is mapped to a hashtag-count pair (in this case as a 


tuple) in which the count is initially 1: 


lick here to view code image 


[10]: mapped = tokenized.map(lambda hashtag: (hashtag, 1)) 


Totaling the Hashtag Counts So Far 


DStream method updateStateByKey receives a two-argument lambda that totals the 


counts for a given key and adds them to the prior total for that key: 


lick here to view code image 


[its hashtag counts = tokenized.updateStateByKey ( 
lambda counts, prior total: sum(counts) + (prior total or 0) 


Specifying the Method to Call for Every RDD 


Finally, we use DSteam method foreachRDD to specify that every processed RDD should be 


passed to function count_tags, which then summarizes the top-20 hashtags so far and 


displays a barplot: 


lick here to view code image 


[12]: hashtag_counts.foreachRDD(count_tags) 


Starting the Spark Stream 


Now, that we’ve specified the processing steps, we call the StreamingContext’s start 


method to connect to the socket and begin the streaming process. 


lick here to view code image 
[13]: ssc.start() # start the Spark streaming 


The following shows a sample barplot produced while processing a stream of tweets about 
“football.” Because football is a different sport in the United States and the rest of the world 
the hashtags relate to both American football and what we call soccer—we grayed out three 


hashtags that were not appropriate for publication: 
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16.8 INTERNET OF THINGS AND DASHBOARDS 


In the late 1960s, the Internet began as the ARPANET, which initially connected four 
universities and grew to 10 nodes by the end of 1970. ? In the last 50 years, that has grown to 
billions of computers, smartphones, tablets and an enormous range of other device types 
connected to the Internet worldwide. Any device connected to the Internet is a “thing” in the 
Internet of Things (IoT). 





3 ttps://en.wikipedia.org/wiki/ARPANET#History.. 


Each device has a unique Internet protocol address (IP address) that identifies it. The 
explosion of connected devices exhausted the approximately 4.3 billion available IPv4 
(Internet Protocol version 4) addresses * and led to the development of IPv6, which supports 
approximately 3.4x10°° addresses (that’s a lot of zeros). 8 


4 ttps://en.wikipedia.org/wiki/IPv4 address exhaustion. 
5 ttps://en.wikipedia.org/wiki/IPv6. 


“Top research firms such as Gartner and McKinsey predict a jump from the 6 billion 


connected devices we have worldwide today, to 20-30 billion by 2020.” © 


Various predictions 
say that number could be 50 billion. Computer-controlled, Internet-connected devices 


continue to proliferate. The following is a small subset IoT device types and applications. 


6 


IoT devices 


ttps://www.pubnub.com/developers/tech/how-pubnub-works/. 


activity 
trackers— 
Apple 
Watch, 
FitBit, 
Amazon 
Dash 
ordering 
hadtons smart home—lights, garage 
openers, video cameras, 

Amazon healthcare—blood glucose monitors  doorbells, irrigation 
Echo for diabetics, blood pressure controllers, security devices, 
(Alexa), monitors, electrocardiograms smart locks, smart plugs, 
Apple (EKG/ECG), electroencephalograms Smoke detectors, 
HomePod (EEG), heart monitors, ingestible thermostats, air vents 
(Siri), sensors, pacemakers, sleep trackers, i 
Google tsunamı sensors 
Home sensors—chemical, gas, GPS, 

eee : tracking devices 
(Google humidity, light, motion, pressure, 
Assistant) Pompe rate wine cellar refrigerators 
appliances— wireless network devices 


ovens, coffee 
makers, 


refrigerators, 


driverless 


cars 


earthquake 


sensors 


loT Issues 


hough there’s a lot of excitement and opportunity in IoT, not everything is positive. There 
are many security, privacy and ethical concerns. Unsecured IoT devices have been used to 
perform distributed-denial-of-service (DDOS) attacks on computer systems. ” Home security 
cameras that you intend to protect your home could potentially be hacked to allow others 
access to the video stream. Voice-controlled devices are always “listening” to hear their 
trigger words. This leads to privacy and security concerns. Children have accidentally ordered 
products on Amazon by talking to Alexa devices, and companies have created TV ads that 
would activate Google Home devices by speaking their trigger words and causing Google 
Assistant to read Wikipedia pages about a product to you. ê Some people worry that these 
devices could be used to eavesdrop. Just recently, a judge ordered Amazon to turn over Alexa 


recordings for use in a criminal case. ° 
7 ttps://threatpost.com/iot-security-concerns-peaking-with-no-end-in-sight/131308/. 


8- ttps://www.symantec.com/content/dam/symantec/docs/security-center/white- 
apers/istr-security-voice-activated-smart-speakers-en. pdf. 


? ttps://techcrunch.com/2018/11/14/amazon-echo-recordings-judge- 


urder-case/. 


This Section’s Examples 


In this section, we discuss the publish/subscribe model that IoT and other types of 
applications use to communicate. First, without writing any code, you'll build a web-based 
dashboard using Freeboard.io and subscribe to a sample live stream from the PubNub 
service. Next, you'll simulate an Internet-connected thermostat which publishes messages to 
the free Dweet.io service using the Python module Dweepy, then create a dashboard 
visualization of it with Freeboard.io. Finally, you'll build a Python client that subscribes to a 
sample live stream from the PubNub service and dynamically visualizes the stream with 
Seaborn and a Matplotlib FuncAnimation. 


16.8.1 Publish and Subscribe 


IoT devices (and many other types of devices and applications) commonly communicate with 
one another and with applications via pub/sub (publisher/subscriber) systems. A 
publisher is any device or application that sends a message to a cloud-based service, which 
in turn sends that message to all subscribers. Typically each publisher specifies a topic or 
channel, and each subscriber specifies one or more topics or channels for which they'd like 
to receive messages. There are many pub/sub systems in use today. In the remainder of this 
section, we'll use PubNub and Dweet.io. You also should investigate Apache Kafka—a 
Hadoop ecosystem component that provides a high-performance publish/subscribe service, 


real-time stream processing and storage of streamed data. 


16.8.2 Visualizing a PubNub Sample Live Stream with a Freeboard 
Dashboard 


PubNub is a pub/sub service geared to real-time applications in which any software and 
device connected to the Internet can communicate via small messages. Some of their 
common use-cases include IoT, chat, online multiplayer games, social apps and collaborative 
apps. PubNub provides several live streams for learning purposes, including one that 


simulates IoT sensors ( ection 16.8.5 lists the others). 


One common use of live data streams is visualizing them for monitoring purposes. In this 
section, you'll connect PubNub’s live simulated sensor stream to a Freeboard.io web-based 
dashboard. A car’s dashboard visualizes data from your car’s sensors, showing information 
such as the outside temperature, your speed, engine temperature, the time and the amount of 
gas remaining. A web-based dashboard does the same thing for data from various sources, 


including IoT devices. 


Freeboard.io is a cloud-based dynamic dashboard visualization tool. You'll see that, without 
writing any code, you can easily connect Freeboard.io to various data streams and visualize 


the data as it arrives. The following dashboard visualizes data from three of the four 


simulated sensors in the PubNub simulated IoT sensors stream: 
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For each sensor, we used a Gauge (the semicircular visualizations) and a Sparkline (the 


50 





jagged lines) to visualize the data. When you complete this section, you'll see the Gauges and 


Sparklines frequently moving as new data arrives multiple times per second. 


In addition to their paid service, Freeboard.io provides an open-source version (with fewer 
options) on GitHub. They also provide tutorials that show how to add custom plug-ins, so 


you can develop your own visualizations to add to their dashboards. 
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Signing up for Freeboard.io 





For this example, register for a Freeboard.io 30-day trial at 


ttps://freeboard.io/signup 


Once you've registered, the My Freeboards page appears. If you’d like, you can click the 
Try a Tutorial button and visualize data from your smartphone. 


Creating a New Dashboard 


n the upper-right corner of the My Freeboards page, enter Sensor Dashboard in the 
enter a name field, then click the Create New button to create a dashboard. This displays 
the dashboard designer. 


Adding a Data Source 


If you add your data source(s) before designing your dashboard, you'll be able to configure 
each visualization as you add it: 


1. Under DATASOURCES, click ADD to specify a new data source. 


2. The DATASOURCE dialog’s TYPE drop-down list shows the currently supported data 
sources, though you can develop plug-ins for new data sources as well. ° Select PubNub. 
The web page for each PubNub sample live stream specifies the Channel and Subscribe 
key. Copy these values from PubNub’s Sensor Network page at 

ttps://www.pubnub.com/developers/realtime-data-streams/sensor- 
etwork/, then insert their values in the corresponding DATASOURCE dialog fields. 
Provide a NAME for your data source, then click SAVE. 





°Some of the listed data sources are available only via Freeboard.io, not the 
open source Freeboard on GitHub. 


Adding a Pane for the Humidity Sensor 


A Freeboard.io dashboard is divided into panes that group visualizations. Multiple panes can 
be dragged to rearrange them. Click the + Add Pane button to add a new pane. Each pane 
can have a title. To set it, click the wrench icon on the pane, specify Humidity for the 
TITLE, then click SAVE. 


Adding a Gauge to the Humidity Pane 


A Freeboard.io dashboard is divided into panes that group visualizations. Multiple panes can 
be dragged to rearrange them. Click the + Add Pane button to add a new pane. Each pane 
can have a title. To set it, click the wrench icon on the pane, specify Humidity for the 
TITLE, then click SAVE. 


Notice that the humidity value has four digits of precision to the right of the decimal point. 
PubNub supports JavaScript expressions, so you can use them to perform calculations or 
format data. For example, you can use JavaScript’s function Math. round to round the 
humidity value to the closest integer. To do so, hover the mouse over the gauge and click its 
wrench icon. Then, insert "Math. round (" before the text in the VALUE field and ") " after 
the text, then click SAVE. 





Adding a Sparkline to the Humidity Pane 


A sparkline is a line graph without axes that’s typically used to give you a sense of how a 
data value is changing over time. Add a sparkline for the humidity sensor by clicking the 
humidity pane’s + button, then selecting Sparkline from the TYPE drop-down list. For the 
VALUE, once again select your data source and humidity, then click SAVE. 


Completing the Dashboard 


Using the techniques above, add two more panes and drag them to the right of the first. 


Name them Radiation Level and Ambient Temperature, respectively, and configure 
each pane with a Gauge and Sparkline as shown above. For the Radiation Level gauge, 
specify Millirads/Hour for the UNITS and 400 for the MAXIMUM. For the Ambient 
Temperature gauge, specify Celsius for the UNITS and 50 for the MAXIMUM. 


16.8.3 Simulating an Internet-Connected Thermostat in Python 


Simulation is one of the most important applications of computers. We used simulation with 
dice rolling in earlier chapters. With IoT, it’s common to use simulators to test your 
applications, especially when you do not have access to actual devices and sensors while 
developing applications. Many cloud vendors have IoT simulation capabilities, such as IBM 
Watson IoT Platform and IOTIFY.io. 


Here, you'll create a script that simulates an Internet-connected thermostat publishing 
periodic JSON messages—called dweets—to dweet . io. The name “dweet” is based on 
“tweet”—a dweet is like a tweet from a device. Many of today’s Internet-connected security 
systems include temperature sensors that can issue low-temperature warnings before pipes 
freeze or high-temperature warnings to indicate there might be a fire. Our simulated sensor 
will send dweets containing a location and temperature, as well as low- and high-temperature 
notifications. These will be True only if the temperature reaches 3 degrees Celsius or 35 
degrees Celsius, respectively. In the next section, we'll use freeboard. io to create a simple 
dashboard that shows the temperature changes as the messages arrive, as well as warning 
lights for low- and high-temperature warnings. 


Installing Dweepy 


To publish messages to dweet . io from Python, first install the Dweepy library: 
pip install dweepy 
The library is straightforward to use. You can view its documentation at: 


ttps://github.com/paddycarey/dweepy 


Invoking the simulator.py Script 


The Python script simulator. py that simulates our thermostat is located in the ch16 
example folder’s i ot subfolder. You invoke the simulator with two command-line arguments 
representing the number of total messages to simulate and the delay in seconds between 


sending dweets: 


ipython simulator:py L000 1 


Sending Dweets 


The simulator .py is shown below. It uses random-number generation and Python 
techniques that you’ve studied throughout this book, so we'll focus just on a few lines of code 
that publish messages to dweet . io via Dweepy. We've broken apart the script below for 


discussion purposes. 


By default, dweet . io is a public service, so any app can publish or subscribe to messages. 
When publishing messages, you’ll want to specify a unique name for your device. 
We used 'temperature-simulator-deitel-python' (line 17). * Lines 18—21 define a 
Python dictionary, which will store the current sensor information. Dweepy will convert this 
into JSON when it sends the dweet. 


*To truly guarantee a unique name, dweet . io can create one for you. The Dweepy 


documentation explains how to do this. 


lick here to view code image 


# simulator.py 

"""”A connected thermostat simulator that publishes JSON 
messages to dweet.io""" 

import dweepy 


1 

2 

3 

4 

5 import sys 
6 import time 

7 import random 

8 

9 MEIN CELS TUS. TEMPI= 225 
10 MAX CELSIUS_TEMP = 45 
11 MAX TEMP CHANGE = 2 


13 # get the number of messages to simulate and delay between them 
14 NUMBER OF MESSAGES = int(sys.argv[1] 
15 MESSAGE DELAY = int(sys.argv[2]) 


16 

17 dweeter = 'temperature-simulator-deitel-python' # provide a unique name 
18 thermostat = {'Location': 'Boston, MA, USA', 

19 'Temperature': 20, 

20 "LowTempWarning': False, 

21 "HighTempWarning': False} 

22 


Lines 25-53 produce the number of simulated message you specify. During each iteration of 
the loop, we 


e generate a random temperature change in the range —2 to +2 degrees and modify the 


temperature, 
e ensure that the temperature remains in the allowed range, 


e check whether the low- or high-temperature sensor has been triggered and update the 
thermostat dictionary accordingly, 


e display how many messages have been generated so far, 
e use Dweepy to send the message to dweet.io (line 52), and 


e use the time module’s sleep function to wait the specified amount of time before 


generating another message. 


lick here to view code image 


23 print('Temperature simulator starting') 
24 


25 for message in range (NUMBER_OF MESSAGES): 


26 # generate a random number in the range -MAX_TEMP_CHANGE 
27 # through MAX TEME CHANGE and add it to the current temperature 
28 thermostat['Temperature'] += random. randrange ( 

29 -MAX TEMP CHANGE, MAX TEMP CHANGE + 1) 

30 

31 # ensure that the temperature stays within range 
32 if thermostat['Temperature'] < MIN_CELSIUS_TEMP: 
33 thermostat['Temperature'] = MIN CELSIUS TEMP 
34 

35 if thermostat['Temperature' ] > MAX CELSIUS TEMP: 
36 thermostat['Temperature'] = MAX CELSIUS TEMP 
37 

38 # check for low temperature warning 

39 if thermostat['Temperature'] = os 

40 thermostat['LowTempWarning'] = True 

41 else: 

42 thermostat['LowTempWarning'] = False 

43 

44 # check for high temperature warning 

45 if thermostat['Temperature'] S S55 

46 thermostat['HighTempWarning'] = True 

47 else: 

48 thermostat['HighTempWarning'] = False 

49 

50 # send the dweet to dweet.io via dweepy 

51 print (f'Messages sent: {message + 1}\r', end='') 
52 dweepy.dweet_for (dweeter, thermostat) 

53 time.sleep(MESSAGE_DELAY) 

54 


55 print('Temperature simulator finished') 


You do not need to register to use the service. On the first call to dweepy’s dweet_for 
function to send a dweet (line 52), dweet . io creates the device name. The function 
receives as arguments the device name (dweeter) and a dictionary representing the message 
to send (thermostat). Once you execute the script, you can immediately begin tracking the 


messages on the dweet . io site by going to the following address in your web browser: 
ttps://dweet.io/follow/temperature-simulator-deitel-python 


If you use a different device name, replace "temperature-simulator-deitel-python" 
with the name you used. The web page contains two tabs. The Visual tab shows you the 
individual data items, displaying a sparkline for any numerical values. The Raw tab shows 
you the actual JSON messages that Dweepy sent to dweet . io. 


16.8.4 Creating the Dashboard with Freeboard.io 


The sites dweet . io and freeboard. io are run by the same company. In the dweet.io 
webpage discussed in the preceding section, you can click the Create a Custom 
Dashboard button to open a new browser tab, with a default dashboard already 
implemented for the temperature sensor. By default, freeboard. io will configure a data 
source named Dweet and auto-generate a dashboard containing one pane for each value in 
the dweet JSON. Within each pane, a text widget will display the corresponding value as the 


messages arrive. 


If you prefer to create your own dashboard, you can use the steps in ection 16.8.2 to create a 
data source (this time selecting Dweepy) and create new panes and widgets, or you can you 


modify the auto-generated dashboard. 


Below are three screen captures of a dashboard consisting of four widgets: 


e A Gauge widget showing the current temperature. For this widget’s VALUE setting, we 
selected the data source’s Temperature field. We also set the UNITS to Celsius and 
the MINIMUM and MAXIMUM values to -25 and 45 degrees, respectively. 


e A Text widget to show the current temperature in Fahrenheit. For this widget, we set the 
INCLUDE SPARKLINE and ANIMATE VALUE CHANGES to YES. For this 
widget’s VALUE setting, we again selected the data source’s Temperature field, then 
added to the end of the VALUE field 


e Oy ee Oe actos 


to perform a calculation that converts the Celsius temperature to Fahrenheit. We also 
specified Fahrenheit in the UNITS field. 


e Finally, we added two Indicator Light widgets. For the first Indicator Lights VALUE 
setting, we selected the data source’s LowTempWarning field, set the TITLE to Freeze 





Warning and set the ON TEXT value to LOW TEMPERATURE WARNING—ON TEXT 














indicates the text to display when value is true. For the second Indicator Lights 
VALUE setting, we selected the data source’s HighTempWarning field, set the TITLE to 





High Temperature Warning and set the ON TEXT value to HIGH TEMPERATURE 
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16.8.5 Creating a Python PubNub Subscriber 


PubNub provides the pubnub Python module for conveniently performing pub/sub 
operations. They also provide seven sample streams for you to experiment with—four real- 


time streams and three simulated streams: ” 





> ttps://www.pubnub.com/developers/realtime-data-streams/. 


e Twitter Stream—provides up to 50 tweets-per-second from the Twitter live stream and 


does not require your Twitter credentials. 
e Hacker News Articles—this site’s recent articles. 
e State Capital Weather—provides weather data for the U.S. state capitals. 
e Wikipedia Changes—a stream of Wikipedia edits. 
e Game State Sync—simulated data from a multiplayer game. 


e Sensor Network—simulated data from radiation, humidity, temperature and ambient 


light sensors. 


e Market Orders—simulated stock orders for five companies. 


In this section, you'll use the pubnub module to subscribe to their simulated Market Orders 


stream, then visualize the changing stock prices as a Seaborn barplot, like: 
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Of course, you also can publish messages to streams. For details, see the pubnub module’s 


documentation at ttps://www.pubnub.com/docs/python/pubnub-python-sdk. 


To prepare for using PubNub in Python, execute the following command to install the latest 
version of the pubnub module—the '>=4.1.2' ensures that at a minimum the 4.1.2 version 


of the pubnub module will be installed: 
pip install "pubnub>=4.1.2" 
The script stocklistener.py that subscribes to the stream and visualizes the stock prices 


is defined in the ch16 folder’s pubnub subfolder. We break the script into pieces here for 


discussion purposes. 


Message Format 


The simulated Market Orders stream returns JSON objects containing five key—value pairs 
with the keys 'bid_price', 'order quantity', 'symbol', 'timestamp' and 
'trade_type'. For this example, we'll use only the 'bid_price' and 'symbol'. The 
PubNub client returns the JSON data to you as a Python dictionary. 


Importing the Libraries 


Lines 3—13 import the libraries used in this example. We discuss the PubNub types imported 


in lines 10—13 as we encounter them below. 


lick here to view code image 


# stocklistener.py 

MuOVastelizang a Pulbwuby live strgan. Em 
from matplotlib import animation 

import matplotlib.pyplot as ple 


import random 
import seaborn as sns 


al 

2 

3 

4 

5 import pandas as pd 
6 

a 

8 import sys 

9 


10 from pubnub.callbacks import SubscribeCallback 

11 from pubnub.enums import PNStatusCategory 

12 from pubnub.pnconfiguration import PNConfiguration 
13. from pubnud.pubnub import PubNub 


List and DataFrame Used for Storing Company Names and Prices 


The list companies contains the names of the companies reported in the Market Orders 
stream, and the pandas DataFrame companies df is where we'll store each company’s 


last price. We'll use this DataF rame with Seaborn to display a bar chart. 


lick here to view code image 


15 companies = ['Apple'’, "Bespin Gas', "Elerium', ‘Google’, ‘Linen Cloth 
16 

17 # DataFrame to store last stock prices 

18 companies df = pd. DataFrame ( 

19 {'company': companies, 'price' Se Oi Ole LOU a3) 

20 
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Class SensorSubscriberCallback 


When you subscribe to a PubNub stream, you must add a listener that receives status 
notifications and messages from the channel. This is similar to the Tweepy listeners you’ve 
defined previously. To create your listener, you must define a subclass of 
SubscribeCallback (module pubnub. callbacks), which we discuss after the code: 


lick here to view code image 


21 class SensorSubscriberCallback (SubscribeCallback): 
22 numsensorSubscriberCallback receives messages from PubNub.""" 
23 def init (self; df, timit=1000):: 


24 """Create instance variables for tracking number of tweets.""" 





25 self.df = df # DataFrame to store last stock prices 

26 self.order count = 10) 

27 self.MAX ORDERS = limit # 1000 by default 

28 super().__init_ () call superclass's init 

29 

30 def status(self, pubnub, status): 

epl if status.category == PNStatusCategory.PNConnectedCategory: 
32 print ( Connected to PubNub') 

33 elif status.category == PNStatusCategory.PNAcknowledgmentCategory: 
34 print('Disconnected from PubNub') 

35 

36 def message(self, pubnub, message): 

37 symbol = message.message['symbol'] 

38 bid price = message.message['bid price'] 

39 print (symbol, bid_price) 

40 self.df.at[companies.index(symbol), ‘orice || = pid price 
41 self.order_count a i 

42 

43 # if MAX ORDERS is reached, unsubscribe from PubNub channel 
44 if self.order_count == self- MAX ORDERS: 

45 pubnub.unsubscribe all() 

46 
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Class SensorSubscriberCallback’s init __ method stores the DataFrame in which 
each new stock price will be placed. The PubNub client calls overridden method status each 
time a new status message arrives. In this case, we’re checking for the notifications that 
indicate that we’ve subscribed to or unsubscribed from a channel. 


The PubNub client calls overridden method message (lines 36—45) when a new message 
arrives from the channel. Lines 37 and 38 get the company name and price from the message, 
which we print so you can see that messages are arriving. Line 40 uses the DataFrame 
method at to locate the appropriate company’s row and its 'price' column, then assign 


that element the new price. Once the order count reaches MAX ORDERS, line 45 calls the 





PubNub client’s unsubscribe all method to unsubscribe from the channel. 


Function Update 


This example visualizes the stock prices using the animation techniques you learned in 
hapter 6’s Intro to Data Science section. Function update specifies how to draw one 

animation frame and is called repeatedly by the FuncAnimation we'll define shortly. We use 

Seaborn function barp1lot to visualize data from the companies df DataFrame, using its 


'company' column values on the x-axis and 'price' column values on the y-axis. 


lick here to view code image 


47 def update(frame_ number): 


48 urucontigures bar plot contents for cach animation frame.""" 
49 plecia:() # clear old barplot 

50 axes = sns.barplot { 

51 data=companies_df, x='company', y='price', palette='cool' 
52 axes.set (xlabel='Company', ylabel='Price') 

593 plt.tight_layout () 

54 


Configuring the Figure 


In the main part of the script, we begin by setting the Seaborn plot style and creating the 
Figure object in which the barplot will be displayed: 


lick here to view code image 





55 IE name == ' main ': 
56 sns.set_style('whitegrid') # white background with gray grid lines 
57 figure = plt.figure('Stock Prices") # Figure for animation 
58 
4 > 

















Configuring the FuncAnimation and Displaying the Window 


Next, we set up the FuncAnimation that calls function update, then call Matplotlib’s show 
method to display the Figure. Normally, this method blocks the script from continuing until 
you close the Figure. Here, we pass the block=False keyword argument to allow the 


script to continue so we can configure the PubNub client and subscribe to a channel. 


lick here to view code image 


59 # configure and start animation that calls function update 
60 stock animation = animation. FuncAnimation ( 

61 figure, update, repeat=False, interval=33) 

62 plt.show(block=False) # display window 

63 


Configuring the PubNub Client 


Next, we configure the PubNub subscription key, which the PubNub client uses in 
combination with the channel name to subscribe to the channel. The key is specified as an 
attribute of the PNConfiguration object (module pubnub.pnconfiguration), which 
line 69 passes to the new PubNub client object (module pubnub . pubnub). Lines 70-72 
create the SensorSubscriberCallback object and pass it to the PubNub client’s 
add_listener method to register it to receive messages from the channel. We use a 


command-line argument to specify the total number of messages to process. 


lick here to view code image 








64 # set up pubnub-market-orders sensor stream key 
65 config = PNConfiguration () 
66 config.subscribe key = 'sub-c-4377ab04-£100-11le3-bffd-02ee2ddab7fe' 
67 
68 # create PubNub client and register a SubscribeCallback 
69 pubnub = PubNub (config) 
70 pubnub.add_listener ( 
71 SensorSubscriberCallback(df=companies df, 
72 limit=int (sys.argv[1] if len(sys.argv) > 1 else 1000)) 
73 
4 > 




















Subscribing to the Channel 


The following statement completes the subscription process, indicating that we wish to 
receive messages from the channel named 'pubnub-market-orders'. The execute 


method starts the stream. 


lick here to view code image 


74 # subscribe to pubnub-sensor-network channel and begin streaming 
75 pubnub.subscribe () .channels ('pubnub-market-orders').execute () 
76 


Ensuring the Figure Remains on the Screen 


The second call to Matplotlib’s show method ensures that the Figure remains on the screen 


until you close its window. 


lick here to view code image 


77 plt.show() # keeps graph on screen until you dismiss its window 


16.9 WRAP-UP 


In this chapter, we introduced big data, discussed how large data is getting and discussed 
hardware and software infrastructure for working with big data. We introduced traditional 
relational databases and Structured Query Language (SQL) and used the sqiite3 module to 
create and manipulate a books database in SQLite. We also demonstrated loading SQL query 


results into pandas DataFrames. 


We discussed the four major types of NoSQL databases—key—value, document, columnar and 
graph—and introduced NewSQL databases. We stored JSON tweet objects as documents in a 
cloud-based MongoDB Atlas cluster, then summarized them in an interactive visualization 
displayed on a Folium map. 


We introduced Hadoop and how it’s used in big-data applications. You configured a multi- 
node Hadoop cluster using the Microsoft Azure HDInsight service, then created and executed 
a Hadoop MapReduce task using Hadoop streaming. 


We discussed Spark and how it’s used in high-performance, real-time big-data applications. 
You used Spark’s functional-style filter/map/reduce capabilities, first on a Jupyter Docker 
stack that runs locally on your own computer, then again using a Microsoft Azure HDInsight 
multi-node Spark cluster. Next, we introduced Spark streaming for processing data in mini- 
batches. As part of that example, we used Spark SQL to query data stored in Spark 


DataFrames. 


The chapter concluded with an introduction to the Internet of Things (IoT) and the 
publish/subscribe model. You used Freeboard.io to create a dashboard visualization of a live 
sample stream from PubNub. You simulated an Internet-connected thermostat which 
published messages to the free dweet . io service using the Python module Dweepy, then 
used Freeboard.io to visualize the simulated device’s data. Finally, you subscribed to a 
PubNub sample live stream using their Python module. 


Thanks for reading Python for Programmers. We hope that you enjoyed the book and that 
you found it entertaining and informative. Most of all we hope you feel empowered to apply 
the technologies you've learned to the challenges you'll face in your career. 
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